Hybrid ML-Based Technique to Classify Malicious Activity Using Log Data of Systems

Mostafa, Almetwally M.; Altheneyan, Alaa; Alnuaim, Abeer; Alhadlaq, Aseel

doi:10.3390/app13042707

Open AccessArticle

Hybrid ML-Based Technique to Classify Malicious Activity Using Log Data of Systems

¹

Department of Information Systems, College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh 11543, Saudi Arabia

²

Department of Computer Science and Engineering, College of Applied Studies and Community Services, King Saud University, P.O. Box 22459, Riyadh 11495, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(4), 2707; https://doi.org/10.3390/app13042707

Submission received: 21 January 2023 / Revised: 14 February 2023 / Accepted: 17 February 2023 / Published: 20 February 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

A computer system generates logs to record all relevant operational data about the system and all operations performed in such systems. System log examination is crucial in the identification of network- and system-level assaults. In comparison to established hazards, new technological advancements and better connections pose a greater degree of risk. Several machine learning algorithms that continually monitor logs of data have been created in the past to defend the system against many threats. The majority of earlier anomaly detection methods need a priori knowledge and are not intended for spotting recent or impending threats. Additionally, the growing volume of logs creates fresh difficulties for anomaly identification. In this study, we developed a machine learning technique that identifies abnormalities in the system log with higher accuracy and efficiency. In our suggested strategy, we started with three log features, preprocessed them, and then got more than 10 features for the model. We performed studies to test the effectiveness of our suggested approach, and the criteria for that included F1 scores, ROCs, accuracy, sensitivity, and specificity. We also evaluated how well our suggested technique performs in comparison to other methods. Our suggested solution has a greater rate of anomaly identification than any previously published algorithms as per the experimental findings based on the dataset obtained from the Blue Green/l supercomputer system.

Keywords:

machine learning; reinforcement learning; malicious activity; system log; anomalies

1. Introduction

It is essential to dive into the foundations of log-based systems as the classification of malicious activities is performed using machine learning. Logs are run-time details about systems that are kept in the computer as plain text. This information may be used to examine system behavior for assisting developers and engineers with system development. It includes user behavior, as well as any operations that are carried out inside the computer system. Malicious activity is defined as user conduct that might damage the system, data, or network. Logs are essential for the creation, upkeep, and improvement of the fundamental architecture of software systems. Researchers have come up with novel mechanisms such as log-based anomaly detection for extracting features and detecting anomalies with higher performance and efficiency [1]. Typically, the system preserves comprehensive data in log files, enabling developers and system engineers to identify system behaviors and potential future issues [2]. Because log files provide extensive information, researchers may use them to assess system flaws that aid in diagnosing activities such as anomalous behavior, statistical analysis, application security, system performance identification, and crash and error diagnostics [3].

Anomalies in log data indicate probable system faults and are crucial for performance and application issue troubleshooting. Finding the timing and location of an issue in an application may be performed using logs and timestamp information [4]. Large volumes of logs are frequently produced by modern software systems (in the event of a cloud-based program, around 1 GB of data each hour). A practical manual examination of log messages for essential diagnostic data is not feasible. The objective of this log analysis is to find unusual activities and possible security risks. The term “abnormal behavior” applies to both attack-related activity and typical behavior of the computer or network. In a typical standalone system, the administrator must personally analyze the system to find anomalies. These anomalies may be found using a code search or standard string matching, but human inspection is still necessary in many cases. For large-scale systems, these techniques are insufficient [5].

Over the last 10 years, academics, developers, and commercial providers have focused on anomaly detection technologies such as automated system log analysis to help the end-user. However, the majority of earlier detection systems used certain patterns and a priori information to find anomalies in the system log. These techniques are limited to recognized attacks. These algorithms are unable to handle all of the difficulties in contemporary systems. Figure 1 shows a summary of the log data related to a specific system used in this research work. The figure shows events of different applications that occurred in the system.

When analyzing system log data, unusual activity should be seen as distinct patterns or traits that should not be followed by ordinary or customary actions throughout the complete operation. According to Grubbs, “outliers” are aberrant observations that seem to diverge considerably from other members of the sample in which they occur. There are certain issues with log data that make it difficult to find anomalies [4].

Redundant runtime data, unstructured plain text, large imbalance data, and anomalies come in three different flavors: “point anomalies”, “contextual anomalies”, and “collective anomalies”.

Point Anomaly

A point anomaly refers to data that fully deviated from the normal or average distribution of all other data points [6]. The system also generates this data; a significant change is limited to certain data points and has some contextual information in common with the other points or average data. Data points O₁ and O₂ in Figure 2 stand for point anomalies. Point anomalies occur when the system experiences uncommon occurrences, including the identification of bank fraud, the use of ATM cards, and the movement of automobiles on the road.

b.: Contextual Anomaly

According to other data points, a contextual anomaly is described as inconsistent conduct that is present in a certain setting and is considered typical [7]. Since it requires root cause sources to identify the data points where irregularities occur, this type of anomaly, also known as a constrained anomaly or a conditional anomaly, is extremely difficult to detect.

c.: Collective Anomaly

A collective anomaly appears to be an assortment of incorrect figures [8,9]. The anomaly described here is a particular trait of log data. The fact that such data points are composed of several data points each with a single aberrant value should be emphasized [10]. In Figure 3, a sample of a collective anomaly is depicted.

1.1. Research Objective

The proposed research paper has the following research objectives:

To comprehend the idea underlying anomaly detection in logs and the identification of dangerous behavior using machine learning algorithms, through a detailed background examination.
To provide a precise and practical method for handling imbalanced datasets in order to foresee anomalies.
To use nonlinear t-SNE (t-distributed stochastic neighbor embedding) approaches and feature refinement and reduction rather than PCA as a classifier.
To handle an uneven dataset through the use of isolation forest, a classifier that beats k-means, and a one-class support vector machine.
To Perform exhaustive experiments to show that the recommended strategy is viable and to compare it to other recent state-of-the-art approaches.

1.2. Paper Organization

The paper comprises five sections. Section 1 introduces log-based systems and their potential in the identification of the unwanted system behaviors. Section 2 gives a broad overview of anomalous principles, technologies, and algorithms, as well as advancements in this field, and it gives all the information needed for this paper. Section 3 provides an overview of the research approaches and their logical flow. Additionally, it offers techniques used in the projects. Section 4 goes through the experimental design, the metrics and circumstances that were used to arrive at our estimate, and the outcomes of the model as they appear in the code. Section 5 contains the authors’ recommendations and final thoughts.

2. Literature Review

This section provides basic background information about anomalies in system logs, as well as traditional anomaly detection methods, summarizes key machine learning challenges, and explains the algorithms used in system log anomaly detection.

2.1. General Anomaly Detection

The term “anomalies” refers to system behavior that is “strange, peculiar, unusual, or difficult to categorize” [11]. Within the context of data science, these anomalies can be recognized (see Figure 4 for example). Anomaly detection includes machine learning methods such as classification, clustering, and regression and is regarded as another branch of science. Anomalies can happen due to different causes, such as unanticipated system modifications, system assaults, changes to system log data, and middle-of-the-road data alterations from distant sensors [11].

Anomaly identification is represented mathematically as process F, with the result F being the domain E and the codomain. To determine if an entity of input data is anomalous or normal, the detection output is employed:

F : E \to F .

“Assume that E is the input area and F is the output area”.

The proposed application, observer, and other external variables have a large role in determining whether an instance of input data is an abnormality or not [12]. Thus, rather than providing a precise mathematical definition of an anomaly, we simply supplied a broad conceptualization of the term previously in this section.

2.2. Types of ML-Based Anomaly Detection Methods

Machine learning (ML) is a subset of AI that focuses on structuring computational systems to learn and reason like humans. Initially, the system is trained so that the algorithm can learn to make sense of the data and the labels that have been assigned to them. The system’s ultimate objective is to make judgments on its own (much like people) during the testing phase in the future. On the basis of the data examined and the conclusions reached by the system, ML operates with a given degree of probability. The heart of machine learning is predictive, involving the capacity to foresee future occurrences on the basis of the past. Therefore, machine learning is crucial to the identification of system log anomalies. Our models were trained to record datasets and examine the model testing outcomes. Machine learning uses a variety of models, and the best models are chosen according to the challenges at hand and the best practices for those models. Each method’s benefits and drawbacks are briefly examined below.

2.2.1. Supervised Anomaly Detection

When training algorithms, fully monitored approaches to anomaly detection often make use of labeled data. Supervised anomaly detection aims to combine two types of domain expertise.

The supervised anomaly detection model is shown in Figure 5. Test data are used to evaluate the correctness of the model.

Two categories of algorithms are used in machine learning supervised learning.

Classification: Each item’s major characteristics or educational type must be prioritized such that the approach yields the correct classification. Extensive studies on this topic may be found in the fields of psychology and computer science. Classifications may be categorized as either binary or multiclass. At first, just two outputs are available. The former can be accomplished, for instance, by determining the website’s language (which could be any dialect, such as Spanish, English, or French), while the latter can be accomplished, for instance, by determining whether or not the log data are typical information or unusual data, e.g., whether or not an email is spam.

Regression: Continuous-response value predictions are made using this sort of algorithm. This sort of algorithm may assume the price of a property in the range of 100,000–2,000,000 USD on the basis of factors such as size, location, length, and number of rooms, or the pay of an individual on the basis factors such as degree, work experience, scale, and city.

2.2.2. Unsupervised Anomaly Detection

Unsupervised anomaly detection is used when we are unsure about whether data points are inliers or outliers. These algorithms are the most adaptable and do not need target labels. According to the dataset’s attributes, the idea of unsupervised anomaly detection methodologies is assigned to the data points as a score. Unsupervised outlier identification is shown in Figure 6; in this method, a model is trained using unlabeled data and produces experimental findings. Some of this approach’s disadvantages are as follows:

Difficulty distinguishing noise from outliers.
High costs associated with first grouping despite much fewer outliers than average items.
It cannot provide precise information as output because no labels are given and the result is less accurate.

2.2.3. Semi-Supervised Anomaly Detection

Data with incomplete labeling may be edited using semi-supervised techniques [13]. Label acquisition is often difficult and calls for the assistance of human domain specialists [14]. As a result, unsupervised techniques are superior to supervised techniques. For this kind of problem when the dataset is improperly labeled, semi-supervised techniques are suitable candidates.

2.2.4. Reinforcement-Based Anomaly Detection

In the field of machine learning issues, reinforcement learning has an intriguing role. It employs a method of improvement and feedback. Reinforcement learning employs agents that learn from outcomes rather than being explicitly instructed [15]. These bots choose new actions on the basis of algorithmic experience and reward successful results. The three primary characteristics of reinforcement learning agents are as follows:

Optimal control.
Animals learn through making mistakes.
They have bias resistance, meaning that supervised learning will not pick up on inherited bias if bias is present in the labeled data. As a result, reinforcement learning is the superior method for providing answers that are devoid of prejudice.

Following a review of the literature, Table 1 provides a summary of prior research publications for comparison with our suggested approach and current algorithms. It is crucial to understand how earlier research was conducted, as well as the kinds of datasets and techniques used. Researchers should focus on the shortcomings of current algorithms to fill any gaps and make improvements. We quickly discuss the title of the publication, the data collection that was utilized, the precision, and the research’s limitations.

3. Proposed Method

By proposing a detailed method, this section tackles the limitations of conventional anomaly detection systems, which spend the bulk of their time creating models for the whole dataset, by adopting a machine learning method that classifies data points using the profile of anomalous occurrences rather than regular cases. The five steps of the recommended method are briefly detailed below.

The Process Outline of the New Approach

The suggested technique is based on five steps: “collecting” logs, “preprocessing” the logs, “dimensionality reduction”, “selecting” a model, and “evaluating”. The steps involved in the suggested technique are shown in Figure 7. A quick rundown of what should happen at each stage is provided below.

(a) Data Collection/Log Collection: To serve as a standard for anomaly detection methods, the BGL (Blue Green/L supercomputer system) dataset was acquired. This dataset, labeled for unsupervised algorithms, has 4,757,963 log messages and 3,484,64 anomalies that may be used in the training of these algorithms. This dataset was generated by data mining specialists may now be used to test the performance of algorithms on various tasks [32].

(b) Preprocessing: The data preparation process is essential in machine learning algorithms. The methods utilized to gather data do not provide a feature that is both accurate and useful for feeding into machine learning programs. However, if no preprocessing is performed beforehand, the model given the raw data may provide misleading results. For this reason, it is crucial to define data in the appropriate format before training algorithms, so that algorithms may properly operate on the data without causing undesirable repercussions. The data collection process includes several steps, including instance selection, normalization, dimension reduction, and feature selection. However, investing in the time-consuming procedures of data preparation and data filtering is crucial to obtaining reliable findings.

(c) Dimensionality Reduction: Datasets that include a huge number of characteristics hinder the comprehension or exploration of the relationships among them. The exploratory data analysis (EDA) phase might drag on and perhaps affect the efficacy of the ML models being used. As a machine learning technique for sparse representation and visualization, we used the t-distributed stochastic neighbor embedding (t-SNE) method developed by Laurens van der Maaten and Geoffrey Hinton [33].

(d) Model selection: We next used the machine learning methods K-means, one-class SVM, and isolation forest after the data were preprocessed. The first step in choosing a model is to analyze the dataset in depth to discover the types of features present. Machine learning algorithm selection is the process of selecting the best machine learning algorithms from a set of candidates. K-means, a one-class support vector machine, and isolation forest were the three methods we settled on.

(e) Performance Evaluation: Machine learning relies heavily on performance evaluations. The goal of a performance evaluation is to find out how well a model or strategy works with extended data. To evaluate the efficacy of the model, we used numerous metrics, including F1-score, sensitivity, and specificity. Performance evaluation is the method used to determine the best terms to use when rating the effectiveness of a machine learning system. After learning everything there is to know about the models and experimental datasets, the next step is to evaluate their performance. In this study, we discuss the models and dataset we used in terms of performance evaluation terminology, such as F1 measure, sensitivity, specificity, and receiver operating characteristics (ROCs).

4. Experiments

The purpose of this section is to evaluate how well the proposed method for detecting anomalies works. The experimental design, including the algorithms, performance measurements, datasets, and experimental framework, is laid forth in the first subsection. Next, we compare and contrast our results with those obtained using previously mentioned methodologies, all based on our testing of the proposed anomaly detection method.

The discussion and the selection of performance metrics are followed by a description of the datasets used for testing, and finally a definition of the circumstances in which the detection systems are assessed.

4.1. Datasets

Our study’s core objective was to identify anomalies in system logs; hence, we chose a dataset that supports this central hypothesis. For the aim of evaluating our suggested approaches, we picked a representative benchmark dataset. The information on the dataset is presented in Table 2.

4.2. Performance Metrics

In order to gauge the efficacy of an algorithm, performance assessment words play a crucial role in machine learning approaches. The performance of algorithms may be measuredv in a variety of ways. There are problem-specific uses for each concept. It is assumed that FP denotes the misinterpretation of data values and TP represents correctly interpreted data points. FN denotes the number of data points identified incorrectly as normal, and TN represents data points correctly identified as normal ones. We selected the ideal time frame for testing the efficacy of our suggested approach to anomaly identification. In this study, the confusion matrix is defined before any specific measure is discussed. For example, a confusion matrix of 2 × 2 is shown below.

[\begin{matrix} TP & FP \\ FN & TN \end{matrix}]

where TP is true positive, FP is false positive, FN is false negative, and TN is true negative.

Algorithm performance is measured in a variety of ways. The choice of techniques, data collection, and attribute type all influence the vocabulary used. The term “accuracy” is first defined.

Accuracy \frac{TP + TN}{TP + TN + FP + FN}

ZDue to its imbalance, this dataset cannot be used for anomaly detection. It is expected that the number of anomalous occurrences is substantially fewer than the number of normal instances, such that TN + FN ≈ TN → ACC_null ≈ 1. Even a typical anomaly detector that tags every data point as normal (TP = 0) would have extremely high accuracy.

However, the metrics of recall (also known as sensitivity or real positive rate) and false positive rate (FPR) are very significant since they indicate the features of a reliable detection system.

Recall = \frac{TP}{TP + FN}

False - Positive rate = \frac{FP}{FP + TN}

The F1-score compares the false rate of two different metrics recall and precision. The F1-score is defined as the harmonic mean of recall and accuracy.

F 1 - score = 2 . \frac{precision * recall}{precision + recall}

F1-score values vary from zero to one, where the latter indicates perfect precision and recall, and the former represents zero precision and recall.

4.3. Data Preprocessing

Sources of raw data are compiled for use in machine learning. This information does not follow any strict standards. For machine learning models to make sense of this information, they must undergo a procedure known as “data preparation”. In data preparation, several algorithms are used, each tailored to a particular task. Missing values, outliers, inconsistent values, restricted or sparse features, and feature engineering demands are all addressed during the preparation stage of data analysis. Unwanted outcomes from models are obtained without data pretreatment. A total of 25,3135 rows in the data collection were utilized to write the paper. Table 3 displays the preprocessing processes applied to these data before they were utilized in the anomaly detection process. Name, timestamp, and IP address were separated into three columns. The feature-building approach was employed to extract useful data from these columns for anomaly detection.

We extracted valuable information from the IP address columns. As shown above in Table 3, the timestamp column facilitated the extraction of date-wise information such as total counts, daily counts, weekend ratio, time difference mean, and time difference max (Table 4). Table 5 show the values of features based on the current timestamp.

The most crucial step after preprocessing is to decrease the dimension of the data so that algorithms can be trained and tested more quickly. To do this, we used the t-SNE technique, which is the most recent and effective for data compression and visualization.

Intensity Lessening by Means of Dimensional Reduction Using T-SNE (Tightly Spaced Neighbor Embedding)

T-distributed stochastic neighborhood-embedded machine learning (t-SNE) was created by Jeffrey Hinton and Laurens van der Maaten. Probabilistic integration based on the neighborhood t-distribution allows high-dimensional data to be represented in 2D or 3D space. t-SNE, a nonlinear method, may convert enormous datasets into 2D or 3D with distance indicating differences and closeness showing similarities. t-SNE, unlike other algorithms, retains both global and local data structures [34]. Figure 8 shows six clusters of data points with similar characteristics. Six clusters symbolize the dataset. The t-distributed stochastic neighbor embedding method reduced the number of clusters from five to two. Red dots represent extreme numbers, whereas blue dots represent normal statistics. Figure 9 shows the findings.

4.4. Experimental Setup

To evaluate the effectiveness of the suggested strategy, three experimental settings were developed. We worked with an unbalanced dataset in our experimental setup, which is typically applied in anomaly detection. We utilized the k-means method in the first trial since it works well with many application data. A one-class support vector machine was set up in the second trial because it is the optimal method for a dataset with distinct distributions of anomaly and non-anomaly data points. Isolation forest, created in 2008, was applied in the third trial, making use of profiling anomalous cases throughout the training phase. Below, each of these algorithms is fully discussed, along with how well it performs.

4.4.1. Experimental Setup 1: K-Means Clustering Algorithm (Unsupervised Machine Learning)

This is a strategy for dealing with unsupervised learning problems by constructing various clusters of abnormal and normal data points. The primary task of this method is to divide the input data into a set of clusters. The number of clusters produced is the central idea, and it is represented by the letter “k” with k = 1 for the first cluster and k = 2 for the second. These clusters need proper configuration for optimal performance. The algorithm performs best if these groups are as wide apart as possible. As a second step, we collect all of the data points and determine how far apart they are from the cluster’s epicenter. If there are no open questions after this step, then it is complete. Using a similarity measure between each data point and the cluster centers obtained in the previous phase, the cluster centers can be identified. A new distance is then determined between each data point in the dataset since the cluster’s center has moved.

A loop is created to control the correct number of clusters. Using that loop, we can see that the center k gradually changes its position until there is no change in the center of the cluster where the center stops moving. In k-means, it is important to determine the initial number of clusters. Different methods are used to detect initial numbers of clusters; in this study, we used a python build function, as shown in Figure 10.

The y-axis in the above graph represents the number of probable clusters used in identifying the malicious behavior. There was no change in the shape of the curve after the y-axis passed the sixth value. Figure 11 depicts the first set of clusters, which consists of six nodes.

The identified six clusters should be classified into two clusters: anomaly and non-anomaly. Thus, the research used t-SNE on identified clusters to convert them into identified categories. The scatter plot results are shown in Figure 12. The results of k-means are given in Table 6. It should be clear that these results were obtained after applying all preprocessing steps.

4.4.2. Experimental Setup 2: One-Class Support Vector Machines

Many classification issues have historically attempted to resolve situations involving multiple classes. Applications for machine learning are designed to utilize training data to separate test data into several groups. What happens when you just have one class of data, and your objective is to compare brand-new data to your training data? One-class support vector machines are a common approach to this problem that has been around for 20 years [35].

Different important parameters are used by the one-class support vector machine method to categorize data points; these critical parameters are detailed in Table 7.

These are the parameters used in support vector machine; for any algorithm, it is important to choose the best parameter values for the best performance [36,37]. These values are set according to the dataset, depending on the algorithm; we set the value of key parameters as shown in Table 8.

The key parameters of the one-class support vector machine are given in Table 9. This result was achieved by tuning the parameters from one dataset to another, consistent with previous research [38]. We only present the values that best fit our algorithms. Researchers can change these values according to the condition of the dataset and the specific application of the data. We compared these results with the results of existing algorithms.

4.4.3. Experimental Setup 3: Isolation Forest

This is a supervised algorithm used for classification and anomaly detection problems; it performs very well for a dataset with an unequal number of classes. It identifies anomalies by separating the data points using randomly selected values and using random division values between the maximum and minimum values of the function selected from a particular set of features. It works like a decision tree algorithm using rules that are generated in the training phase. It enables the separation of abnormal data points because they have shortened path characteristics.

In normal situations, to detect anomalies, we create a profile of normal data points and identify those points that do not follow this profile. This is a complex and computationally costly task because of the largest path characteristic. Isolation forest does not work using the principle of defining normal behavior and distance measuring between data points. As the word isolation suggests, it works by isolating the anomalies explicitly.

Isolation forest operates on the premise that abnormalities should be modest and have distinguishable anomalous behavior. In isolation forests, anomalies are isolated using a collection of isolation trees as data points. For profiling anomalous data points in a collection of various data points, it employs various combinations of isolation trees. We do not need to raise the depth of the tree in a remote forest, and we may decrease the computational expense and memory demand since abnormal data points have significantly shorter tree pathways than regular data points [36]. For decision making with regard to malicious activity, the score is classified as per the data points into anomaly and non-anomaly [37]. Table 10 shows the key variables used for the isolation forest. Each observation in the data collection is given an anomaly score. The categories of anomaly scores are as follows:

An observation is an anomaly if the score is closest to 1.
An observation is not an abnormality if the score is much lower than 1.
An observation is a non-anomaly if the score is closest to 0.5.

The practical parameter values used in the implementation of our experiments are shown in Table 11.

The isolation forest results are shown in Table 12.

4.5. Summary of Results

The overall execution of the selected algorithms is shown in Table 13.

In our tests, we assessed the one-class support vector, isolation forest, and unsupervised k-means performance in anomaly detection. All of the methods were effective; however, our suggested approach, isolation forest, had the best accuracy. Using the same data set, we then compared the outcomes of our suggested method with those of current models. We also discuss the parameters for the algorithms implemented by other academics; the results are summarized in Table 14.

Table 14 compares the outcomes of our suggested strategy with currently used machine learning techniques. These techniques have a wide range of accuracy. Table 14 shows that the approach invariant miner performed well (F-measure: 0.91, recall: 0.99, and precision: 0.83). All techniques were compared to our suggested approach to determine its accuracy.

5. Conclusions and Future Recommendations

As the world grows more and more log data-driven, the challenge of log data anomaly detection becomes crucial in many application domains. There is no standard way to detect log data anomalies, and, as the volume of log data rises, the loss of accuracy and the computation become more complicated. Finding anomalies in system logs is a research topic in the area of machine learning, and this article provided a detailed discussion of the relevant machine learning techniques. This study also discussed techniques for efficiently locating outliers in massive datasets. Improving the performance–generalization tradeoff for anomaly detection in system log data is the main contribution.

This research study discussed several log-based anomaly detection research techniques and their applications in many fields. The difficulties in detecting system log anomalies are discussed in this article, along with some current solutions. Each category of system log anomaly detection strategies was described in terms of many algorithms, along with their strengths and limitations. We also provided a short overview of each method now in use so that each method’s role in anomaly identification could be understood. This article’s goal was to examine several anomaly detection models for anomaly detection, evaluate their shortcomings, and determine whether they are appropriate for a given dataset.

By looking at anomalies and anomaly detection techniques, we examined queries from many researchers and provided answers in this work. We also provided a summary of the shortcomings of the available algorithms. We looked at other techniques and creative approaches for finding abnormalities in log data. We investigated cutting-edge models such as unsupervised K-means, one-class SVM, and isolation forest. We compared their precision and computational expense to available machine learning techniques. These tests and analyses showed that the suggested technique outperformed other algorithms with an F1 score of 0.94, accuracy of 0.99, roc-curve of 0.99, sensitivity of 1.00, and specificity of 0.90, falling short of BG/maximum l’s accuracy.

Future research will mainly focus on maintaining the generalization of algorithms utilizing machine learning approaches as log data grow in size. Beyond this, our primary goal for future work is to develop a setup that can assess each algorithm’s performance in terms of accuracy, generalization, and real efficacy for various system log datasets. In this field of research, our main goal is to reduce the attack vulnerabilities of computer systems.

Future studies will also include testing the suggested approach against other datasets to assess its accuracy and computing efficiency. More complicated system logs will be used throughout the training and testing phases to ensure that the algorithm performs better in the real world. The creation of automated log anomaly detectors that can employ a deep learning method to identify abnormalities and categorize root causes across many classes must also be planned in future investigations.

Author Contributions

Conceptualization, A.A. (Aseel Alhadlaq); Methodology, A.M.M., A.A. (Abeer Alnuaim) and A.A. (Alaa Altheneyan); Formal analysis, A.A. (Abeer Alnuaim); Resources, A.A. (Aseel Alhadlaq); Data curation, A.A. (Abeer Alnuaim) and A.A. (Alaa Altheneyan); Writing—original draft, A.M.M.; Writing—review & editing, A.A. (Abeer Alnuaim). All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the Researchers Supporting Project (RSPD2023R796), King Saud University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Did not required ethical approval.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors acknowledge the Researchers Supporting Project (RSPD2023R796), King Saud University, Riyadh, Saudi Arabia.

Conflicts of Interest

No conflict of interest.

References

Qi, J.; Luan, Z.; Huang, S.; Wang, Y.; Fung, C.; Yang, H.; Qian, D. Adanomaly: Adaptive Anomaly Detec-tion for System Logs with Adversarial Learning. In Proceedings of the NOMS 2022–2022 IEEE/IFIP Network Operations and Management Sym-posium, Budapest, Hungary, 25–29 April 2022; pp. 1–5. [Google Scholar]
Suthishni, D.N.P.; Kumar, K.S. A Review on Machine Learning based Security Approaches in Intrusion Detection System. In Proceedings of the 2022 9th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 23–25 March 2022; pp. 341–348. [Google Scholar]
Jose, J.M.; Reeja, S.R. Anomaly Detection on System Generated Logs—A Survey Study. In Mobile Computing and Sustainable Informatics; Springer: Singapore, 2022; pp. 779–793. [Google Scholar]
Fang, W.; Tan, X.; Wilbur, D. Application of intrusion detection technology in network safety based on machine learning. Saf. Sci. 2020, 124, 104604. [Google Scholar] [CrossRef]
Karimipour, H.; Dehghantanha, A.; Parizi, R.M.; Choo, K.K.R.; Leung, H. A deep and scalable unsupervised ma-chine learning system for cyber-attack detection in large-scale smart grids. IEEE Access 2019, 7, 80778–80788. [Google Scholar] [CrossRef]
Nicholas, D.; Huntington, P.; Homewood, J. Assessing used content across five digital health information services using transaction log files. J. Inf. Sci. 2003, 29, 499–515. [Google Scholar] [CrossRef] [Green Version]
Henriques, J.; Caldeira, F.; Cruz, T.; Simões, P. Combining k-means and xgboost models for anomaly detection us-ing log datasets. Electronics 2020, 9, 1164. [Google Scholar] [CrossRef]
Foorthuis, R. On the nature and types of anomalies: A review of deviations in data. Int. J. Data Sci. Anal. 2021, 12, 297–331. [Google Scholar] [CrossRef]
Ahmed, M.; Pathan, A.S.K. Deep learning for collective anomaly detection. Int. J. Comput. Sci. Eng. 2020, 21, 137–145. [Google Scholar] [CrossRef]
Maschler, B.; Pham, T.T.H.; Weyrich, M. Regularization-based Continual Learning for Anomaly Detection in Dis-crete Manufacturing. Procedia CIRP 2021, 104, 452–457. [Google Scholar] [CrossRef]
Crespo Márquez, A. Techniques for Anomalies Detection. In Digital Maintenance Management; Springer: Cham, Switzerland, 2022; pp. 117–132. [Google Scholar]
Yahaya, S.W.; Lotfi, A.; Mahmud, M. A Consensus Novelty Detection Ensemble Approach for Anomaly Detection in Activities of Daily Living. Appl. Soft Comput. 2019, 83, 105613. [Google Scholar] [CrossRef]
Villa-Pérez, M.E.; Álvarez-Carmona, M.; Loyola-González, O.; Medina-Pérez, M.A.; Velazco-Rossell, J.C.; Choo, K.-K.R. Semi-supervised anomaly detection algorithms: A comparative summary and future research directions. Knowledge-Based Syst. 2021, 218, 106878. [Google Scholar] [CrossRef]
Vanhoeyveld, J.; Martens, D.; Peeters, B. Value-added tax fraud detection with scalable anomaly detection tech-niques. Appl. Soft Comput. 2020, 86, 105895. [Google Scholar] [CrossRef]
Kulkarni, P.S.; Stranieri, A.; Mahableshwarkar, A.; Kulkarni, M. Deep Reinforcement-Based Conversational AI Agent in Healthcare System. In Next Generation Healthcare Informatics; Springer: Singapore, 2022; pp. 233–249. [Google Scholar]
Goernitz, N.; Kloft, M.; Rieck, K.; Brefeld, U. Toward Supervised Anomaly Detection. J. Artif. Intell. Res. 2013, 46, 235–262. [Google Scholar] [CrossRef] [Green Version]
Merrill, N.; Eskandarian, A. Modified Autoencoder Training and Scoring for Robust Unsupervised Anomaly Detection in Deep Learning. IEEE Access 2020, 8, 101824–101833. [Google Scholar] [CrossRef]
Kumar, V.; Maheshwari, R.; Akhtar, S. Energy Efficient Wireless Sensor Networks using Co-operative MIMO: A Technical Review. Int. J. Comput. Appl. 2016, 135, 20–27. [Google Scholar] [CrossRef]
Liu, Z.; Qin, T.; Guan, X.; Jiang, H.; Wang, C. An Integrated Method for Anomaly Detection From Massive System Logs. IEEE Access 2018, 6, 30602–30611. [Google Scholar] [CrossRef]
Chen, M.; Zheng, A.; Lloyd, J.; Jordan, M.; Brewer, E. Failure diagnosis using decision trees. In Proceedings of the International Conference on Autonomic Computing, New York, NY, USA, 17–18 May 2004; pp. 36–43. [Google Scholar] [CrossRef]
Manoj, V.V.R.; Narayana, V.A.R.; Bhargavi, A. Outlier Detection using Reverse Neares Neighbor for Unsupervised Data. Int. J. Trend Sci. Res. Dev. 2018, 2, 1511–1513. [Google Scholar] [CrossRef] [Green Version]
Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11125.
Astekin, M.; Zengin, H.; Sozer, H. Evaluation of Distributed Machine Learning Algorithms for Anomaly Detection from Large-Scale System Logs: A Case Study. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2071–2077. [Google Scholar] [CrossRef]
Zhu, G.; Liao, J. 2008 Research of Intrusion Detection Based on Support Vector Machine. In Proceedings of the International Conference on Advanced Computer Theory and Engineering, Phuket, Thailand, 20–22 December 2008; pp. 434–438. [Google Scholar] [CrossRef]
Ren, R.; Cheng, J.; Yin, Y.; Zhan, J.; Wang, L.; Li, J.; Luo, C. Deep Convolutional Neural Networks for Log Event Classification on Distributed Cluster Systems. In Proceedings of the Conference: IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018. [Google Scholar] [CrossRef]
Wen, L. Research on System Design and Implementation of Computer Forensics Based on Log. In Proceedings of the International Conference on Computer Technology, Electronics and Communication (ICCTEC), Dalian, China, 19–21 December 2017; pp. 388–391. [Google Scholar] [CrossRef]
V, D. A Study on Log Parser Analysis and Error Detection using Big Data. Int. J. Res. Appl. Sci. Eng. Technol. 2018, 6, 1584–1586. [Google Scholar] [CrossRef]
Breier, J.; Branišová, J. A Dynamic Rule Creation Based Anomaly Detection Method for Identifying Security Breaches in Log Records. Wirel. Pers. Commun. 2015, 94, 497–511. [Google Scholar] [CrossRef]
ElMenshawy, D.; Helmy, W.; El-Tazi, N. A Clustering based Approach for Contextual Anomaly Detection in Internet of Things. J. Comput. Sci. 2019, 15, 1195–1202. [Google Scholar] [CrossRef] [Green Version]
Agrawal, S.; Agrawal, J. Survey on Anomaly Detection using Data Mining Techniques. Procedia Comput. Sci. 2015, 60, 708–713. [Google Scholar] [CrossRef] [Green Version]
He, S.; Zhu, J.; He, P.; Lyu, M.R. Experience Report: System Log Analysis for Anomaly Detection. In Proceedings of the IEEE 27th Inter-national Symposium on Software Reliability Engineering (ISSRE), Ottawa, ON, Canada, 30–31 July 2019; pp. 207–218. [Google Scholar] [CrossRef]
Battineni, G.; Chintalapudi, N.; Amenta, F. Machine learning in medicine: Performance calculation of dementia prediction by support vector machines (SVM). Informatics Med. Unlocked 2019, 16, 100200. [Google Scholar] [CrossRef]
Aussel, N.; Petetin, Y.; Chabridon, S. An evaluation study on log parsing and its use in log mining. In Proceedings of the the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France, 28 June–1 July 2016. [Google Scholar] [CrossRef]
Fu, Q.; Lou, J.-G.; Wang, Y.; Li, J. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, Miami, FL, USA, 6–9 December 2009; pp. 149–158. [Google Scholar] [CrossRef]
Lima, M.F.; Zarpelão, B.; Sampaio, L.D.; Rodrigues, J.; Abrão, T.; Proença, M.L. Anomaly detection using baseline and K-means clustering. In Proceedings of the SoftCOM 2010, 18th International Conference on Software, Telecommunications and Computer Networks, Split, Croatia, 23–25 September 2010; pp. 305–309. Available online: https://ieeexplore.ieee.org/document/5623690 (accessed on 1 November 2022).
Ripan, R.C.; Sarker, I.H.; Hossain, S.M.M.; Anwar, M.; Nowrozy, R.; Hoque, M.M.; Furhad, H. A Data-Driven Heart Disease Prediction Model Through K-Means Clustering-Based Anomaly Detection. SN Comput. Sci. 2021, 2, 1–12. [Google Scholar] [CrossRef]
Fumanal-Idocin, J.; Rodriguez-Martinez, I.; Indurain, A.; Minárová, M.; Bustince, H. Almost aggregations in the gravitational clustering to perform anomaly detection. Inf. Sci. 2022, 612, 399–413. [Google Scholar] [CrossRef]
Luo, Y.; Xiao, Y.; Cheng, L.; Peng, G.; Yao, D. Deep learning-based anomaly detection in cyber-physical systems: Progress and opportunities. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]

Figure 1. Log data overview (BLOCK*: represents a block or bulk of events/activities/logs).

Figure 2. Point anomaly [6].

Figure 3. Collective anomalies.

Figure 4. Anomaly recognition in data.

Figure 5. Supervised machine learning.

Figure 6. Unsupervised anomaly detection.

Figure 7. Workflow of the proposed method.

Figure 8. Before t-SNE.

Figure 9. Result of t-SNE.

Figure 10. Cluster formation.

Figure 11. Number of Clusters.

Figure 12. Possible Number of Clusters.

Table 1. Summary of existing work.

Ref	Algorithms/Methodology	Data Set (s)	Accuracy	Limitations
[16]	DRAIN log parser	HFDS, HPC BGL, LINUX HADOOP	86%	Not for industrial deployment or state identification and is not suitable to deal with log messages with variable lengths.
[17]	Log clustering algorithm Hierarchical clustering	Mail, Cron, and Kern	96%	The pattern selection criteria for a novel form of message are not covered, and not all system framework types are supported.
[18]	K-prototype clustering K-NN classification Clustering–filtering refinement	SDS1-5	Precision 99–100% Recall 87–95%	Due to various types of log data and associated attribute values, accuracy and computational issues on a new form of log data generate issues.
[19]	Decision tree	Not available	F-measure 93%	It works for a specific type of data. Does not support complex patterns and specifications of the system.
[20]	CE (cluster evaluation) String clustering TSA	Lack of access to a bug-tracking system	85%	As the threshold for abnormalities rises, the precision of the system will fall.
[21]	K-means PCA	HDFS	97%	Instead of batch data, it utilizes live stream data.
[22]	Electronic evidence, Trap technology, Analysis of electronic evidence	No data set is used	No accuracy was computed	Data-specific methodology
[23]	SVM	DARPA	93%	New unstructured log data are not recommended
[24]	Deep CNN	CMRI-Hadoop blue gene/L	Precision 98% Recall 98% F1-measure 98%	For data from a big sample, it is not accurate.
[25]	Recurrent neural network LSTM network	HDFS Open stack	96%	Serves a single purpose and does not function with different kinds of log files or collected log data from different systems.
[26]	Change detection algorithm	No dataset is available	98%	Failure to adjust to a varied fault injection rate necessitates the thorough execution of necessary steps.
[27]	Apache Hadoop technique Dynamic rule creation	DARPA Snort	90%	There is a lot of work involved in converting each field to binary. Therefore, the IP address and port type fields are inoperable.
[28]	The unsupervised algorithm Log clustering PCA Invariant mining	BGL HDFS	92%	Variations in datasets might cause issues with precision and processing efficiency.
[29]	SVM linear SVM polynomial SVM RBE	Not available	F-measure 85%	Does not support firewall attacks
[30]	SD (subgroup discovery)	Not available	Threshold	The subgroup finding approach may be replaced with other methods that perform more effectively.
[31]	RNN	IE	88.99%	Other software may provide a better outcome. The modules must be trained using a neural network for various purposes.

Table 2. Dataset details.

Dataset	Dataset Size	Dataset Range	Identified Anomalies	Log Count
BGL	708 MB	7-Months	348460	4747963

Table 3. Raw dataset features.

@timestamp	IP	Log_ID
26 June 2020, 23:34:29	10.1.1.260	Ju8BoTGddM7v_yJz
26 June 2020, 23:34:35	10.1.2.350	8sYklmsBP0GwVzFkzn
26 June 2020, 23:34:39	10.1.1.495	r-8klmsBoTGd7vJiSE
26 June 2020, 23:34:45	10.1.2.101	18YklmsBP0GwVzPU_1
26 June 2020, 23:34:49	10.1.2.428	HsYklmsB7mP0GwVzTVGW
26 June 2020, 23:34:55	10.1.1.15	cO8ksBoTGddM7vZSgD
26 June 2020, 23:34:25	10.1.1.109	vsYjlmsB7mP0GwVz70nV
26 June 2020, 23:34:19	10.1.2.257	NMYjlmsB7mP0GwVz2Ehn

Table 4. IP address feature building.

IP	Total Count	Daily Counts	Is_Weekend_Ratio	td_Mean	td_Max
10.1.1.114	489	61	2.037267	23.379098	1135.0
10.1.1.100	1024	116	1.745308	10.908113	1140.0
10.1.1.101	502	58	2.005988	22.584830	1141.0
10.1.1.109	495	55	1.734807	23.070850	1151.0
10.1.1.110	542	67	1.852632	20.974122	1155.0
10.1.1.106	505	63	1.805556	22.573413	1140.0
10.1.1.1	525	56	1.966102	21.666031	1200.0

Table 5. Feature building based on timestamp.

Timestamp	IP	Shift Time	Time Diff	Date	Dow	Hour
2020-06-30 13:01:00	10.1.1.1	2020-06-30 12:30:27	30.0	2020-06-30	6	13
2020-06-30 12:30:27	10.1.1.1	2020-06-30 12:06:49	23.0	2020-06-30	6	12
2020-06-30 12:06:49	10.1.1.1	2020-06-30 12:05:45	1.0	2020-06-30	6	12
2020-06-30 12:05:45	10.1.1.1	NaT	NaN	2020-06-30	6	12

Table 6. K-means experimental result.

F1 Score	Accuracy	ROC Curve	Specificity	Sensitivity
0.90	0.99	0.99	0.81	1.00

Table 7. Key parameters used in support vector machine.

Parameters	Explanations
Kernel	It specifies whether the method employs a linear, polynomial, radial basis function (RBF), sigmoidal, precalculated, or callable kernel as a tuning parameter. If we make no choices, “RBF” is applied.
Degree	It indicates the degree of the utilized kernel functions of polynomials.
Gamma	The gamma value is a kernel coefficient.
Coef0	Plot and sigmoid functions employ it as a self-determining term.
Nu	It shows the upper and lowers bound values in the interval [0, 1]. It may be set to 0.5 by default.
Shrinking	It may be used to make heuristics more compact.
Cache-size	The amount of memory in the kernel cache, measured in megabytes, must be provided.
Verbose	It allows for extensive production. A runtime option in the lib-preprocessing SVM’s section is used, as communicated below.
Max- iteration	It is challenging to restrict iterations.

Table 8. Parameter values used in support vector machine.

Parameter	Value	Parameter	Value
Cache	200 MB	Degree	2
Tol	0.001	Verbose	False
Coef0	0.0	Max-iteration	−1
Nu	0.025	Shrinking	True
Kernel	Rbf	Gamma	Auto

Table 9. Support vector machine result of one-class.

F1 Score	Accuracy	ROC Cure	Specificity	Sensitivity
0.64	0.97	0.97	0.97	0.97

Table 10. The key variables employed in isolation forest.

Parameters Name	Explanation
n-estimators	Estimators of the assembly’s baseline.
Max-sampling	During training, each estimator is given a certain number of samples from X.
Contaminations	A symbol of cross-contamination in our dataset.
Max-features	From X, we derive the number of features needed to train the model.
Bootstrap	If true, then sampling is performed with replacement; otherwise, sampling is performed without it.
n-jobs	The total number of concurrent fit and forecast tasks.
Behaviors	For backward compatibility, it may be utilized.
Random state	Adjusts how randomly the feature is picked.
Verbose	The level of detail used in the tree-building process is regulated.
Warm start	If the result is correct, we may use it to put together and add additional estimators from the previous call; otherwise, we need to create a new forest.

Table 11. Values of used implementation parameters.

Parameter	Value	Parameter	Value
Max-sample	Auto	verbose	0
Contamination	0.025	n-jobs	1
Bootstrap	False	n-estimators	200
Max-feature	3	Random state	100

Table 12. Proposed isolation forest results.

F1 Score	Accuracy	ROC Curve	Specificity	Sensitivity
0.94	0.99	0.99	1.00	0.90

Table 13. Technique of performance evaluation.

Performance Evolution	Isolation-Forest	K-Means	SVM
F1-score	0.94	0.90	0.64
Accuracy	0.99	0.99	0.97
ROC-curve	0.99	0.99	0.98
Specificity	0.900	0.81	0.50
Sensitivity	1.00	1.00	1.00

Table 14. Comparison with existing algorithms.

Performance Evolution	Logistic Regression	Decision Tree	SVM	Log Clustering	PCA	Invariant Miner	Proposed Method
Sensitivity	0.95	0.95	0.95	0.42	0.50	0.83	0.90
specificity	0.57	0.57	0.57	0.87	0.61	0.99	1.00
F-measure	0.94	0.71	0.72	0.57	0.55	0.91	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mostafa, A.M.; Altheneyan, A.; Alnuaim, A.; Alhadlaq, A. Hybrid ML-Based Technique to Classify Malicious Activity Using Log Data of Systems. Appl. Sci. 2023, 13, 2707. https://doi.org/10.3390/app13042707

AMA Style

Mostafa AM, Altheneyan A, Alnuaim A, Alhadlaq A. Hybrid ML-Based Technique to Classify Malicious Activity Using Log Data of Systems. Applied Sciences. 2023; 13(4):2707. https://doi.org/10.3390/app13042707

Chicago/Turabian Style

Mostafa, Almetwally M., Alaa Altheneyan, Abeer Alnuaim, and Aseel Alhadlaq. 2023. "Hybrid ML-Based Technique to Classify Malicious Activity Using Log Data of Systems" Applied Sciences 13, no. 4: 2707. https://doi.org/10.3390/app13042707

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid ML-Based Technique to Classify Malicious Activity Using Log Data of Systems

Abstract

1. Introduction

1.1. Research Objective

1.2. Paper Organization

2. Literature Review

2.1. General Anomaly Detection

2.2. Types of ML-Based Anomaly Detection Methods

2.2.1. Supervised Anomaly Detection

2.2.2. Unsupervised Anomaly Detection

2.2.3. Semi-Supervised Anomaly Detection

2.2.4. Reinforcement-Based Anomaly Detection

3. Proposed Method

The Process Outline of the New Approach

4. Experiments

4.1. Datasets

4.2. Performance Metrics

4.3. Data Preprocessing

Intensity Lessening by Means of Dimensional Reduction Using T-SNE (Tightly Spaced Neighbor Embedding)

4.4. Experimental Setup

4.4.1. Experimental Setup 1: K-Means Clustering Algorithm (Unsupervised Machine Learning)

4.4.2. Experimental Setup 2: One-Class Support Vector Machines

4.4.3. Experimental Setup 3: Isolation Forest

4.5. Summary of Results

5. Conclusions and Future Recommendations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI