A Machine Anomalous Sound Detection Method Using the lMS Spectrogram and ES-MobileNetV3 Network

Wang, Mei; Mei, Qingshan; Song, Xiyu; Liu, Xin; Kan, Ruixiang; Yao, Fangzhi; Xiong, Junhan; Qiu, Hongbing

doi:10.3390/app132312912

Open AccessArticle

A Machine Anomalous Sound Detection Method Using the lMS Spectrogram and ES-MobileNetV3 Network

¹

College of Information Science and Engineering, Guilin University of Technology, Guilin 541006, China

²

Ministry of Education Key Laboratory of Cognitive Radio and Information Processing, Guilin University of Electronic Technology, Guilin 541004, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12912; https://doi.org/10.3390/app132312912

Submission received: 28 October 2023 / Revised: 19 November 2023 / Accepted: 29 November 2023 / Published: 2 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Unsupervised anomalous sound detection by machines holds significant importance within the realm of industrial automation. Currently, the task of machine-based anomalous sound detection in complex industrial settings is faced with issues such as the challenge of extracting acoustic feature information and an insufficient feature extraction capability within the detection network. To address these challenges, this study proposes a machine anomalous sound detection method using the lMS spectrogram and ES-MobileNetV3 network. Firstly, the log-Mel spectrogram feature and the SincNet spectrogram feature are extracted from the raw wave, and the new lMS spectrogram is formed after fusion, serving as network input features. Subsequently, based on the MobileNetV3 network, an improved detection network, ES-MobileNetV3, is proposed in this paper. This network incorporates the Efficient Channel Attention module and the SoftPool method, which collectively reduces the loss of feature information and enhances the feature extraction capability of the detection network. Finally, experiments are conducted on the dataset provided by DCASE 2020 Task 2. Our proposed method attained an averaged area under the receiver operating characteristic curve (AUC) of 96.67% and an averaged partial AUC (pAUC) of 92.38%, demonstrating superior detection performance compared to other advanced methods.

Keywords:

anomalous sound detection; acoustic features; convolutional neural network; attention mechanism

1. Introduction

With the rapid development of industrial technology and the continuous innovation of automation technology, modern industrial machine production equipment is rapidly evolving towards integration, automation, and intelligence. The stable operation of the production machine plays a critical role in ensuring production efficiency, product quality, and even the life and property safety of the employees. Therefore, it is imperative to monitor the working conditions of a machine during the process of machine operation to ensure the normal production of enterprises.

The anomalous sound detection (ASD) system, which relies on the perception, processing, and decision making of sound signals, is garnering increasing enthusiasm in the field of monitoring, owing to its advantages in safeguarding user privacy [1], real-time capabilities [2], and adaptability to complex environments [3]. Currently, anomalous sound detection technology finds widespread application in various fields, including animal husbandry [4,5], machine condition monitoring [6,7], medical surveillance [8,9], and many other aspects.

In industrial scenarios, ASD involves a sound detection task for determining whether the sound emitted by a specific object is normal or anomalous. The detection methods can be broadly categorized as supervised or unsupervised: supervised anomalous sound detection entails predefined anomalies during training, whereas unsupervised methods for anomalous sound detection provide solely normal sound samples for model training. However, in real-world factories, actual machine anomalous sounds are infrequent. Access is more readily available to sounds produced by machines operating under normal conditions, while machine anomalous sound is rare and highly diverse. To collect sufficient anomalous sound data, many machines and devices must be deliberately destroyed to record machine anomalous sounds. Deliberately damaging the machine to collect anomalous sound data of the machine requires very large machine purchase costs and human resources. Therefore, the collection of anomalous sound data through deliberate destruction is costly, making it challenging to collect a sufficient quantity of machine anomalous sound data for model training [10].

Therefore, the unsupervised anomalous sound detection method, which maximizes the use of normal sound data from the machine, is better suited to cope with the requirements of machine condition monitoring in real-world industrial scenarios. The general framework for unsupervised ASD is depicted in Figure 1. In unsupervised ASD, only normal sound samples are provided for model training. We first need to extract acoustic features from the raw sound data as inputs for the model. During the training process, the model will learn the sound information of the input features and calculate the anomalous score. Then, the anomalous score is calculated as the negative log-likelihood probability. During the model testing process, we need to set a threshold in advance. When the anomalous score of the input audio sample is less than the threshold, it will be judged as normal; otherwise, it will be judged as anomalous.

As there is a considerable demand for anomalous sound monitoring in industrial application scenarios, some research teams have produced datasets on machine anomalous sound. We present a brief summary of the widely used machine anomalous sound dataset. The first dataset is the Case Western Reserve University (CWRU) Bearing Dataset [11]. It is specifically designed for studying the health condition of rolling element bearings in rotating machinery. The dataset may cover different types of bearings commonly found in industrial machinery. This diversity allows researchers to address the challenges associated with various bearing designs and operating conditions. The second common dataset is the mechanical dataset [12], including induction motors, gearboxes, and bearings with sizes of 6000, 9000, and 5000 time-series samples, respectively. All three machines are highly accurate machines, so the main goal of this dataset is for highly accurate machine fault detection. In addition to that, the “Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring” task is introduced within the Detection and Classification of Acoustic Scenes and Events 2020 (DCASE 2020) [13], and has attracted the participation of numerous scholars worldwide.

In real-world industrial environments, the working conditions of industrial machines are often highly complex. The operational sounds of these machines are frequently intertwined with a myriad of environmental noises. The sound signals produced by machines exhibit intricate characteristics, including nonlinearity and multiple aliasing [14]. Consequently, the extraction of effective acoustic features tailored to these characteristics plays a pivotal role in enhancing the performance of machine anomalous sound detection. A machine anomalous sound detection method that utilizes x-vectors as input features is proposed in [15]. The audio subnet of L3-Net was employed to extract audio embeddings. However, the network forces a distinct behavior of the x-vector, leading to performance degradation. Perez et al. [16] adopt Gammatone filters for extracting acoustic information and generate the Gammatone spectrogram as input features to enhance the representation of machine sound amplitude and phase information. Nonetheless, the performance of Gammatone filters is closely linked to the configuration of calculation parameters, necessitating a substantial number of experiments to select suitable settings.

In order to devise a more stable approach for extracting acoustic features from machines, some researchers have undertaken experimental investigations into the effectiveness of the log-Mel spectrogram feature [17,18]. This feature extraction method is founded on the Mel filter bank, designed to emulate the auditory perception of the human ear. Consequently, the Mel filter bank exhibits a denser filter distribution in the low-frequency range and a sparser distribution in the high-frequency range. This indicates that the log-Mel spectrogram emphasizes the low-frequency information of machine sound, potentially leading to the loss of critical information during the acoustic feature extraction process. This may affect the feature expression of key machine sound information.

In addition to exploring acoustic features, the selection of machine anomalous sound detection models is also a prominent topic within this field. Suefusa et al. [19] employ the traditional interpolation deep neural network (IDNN) to learn the features of normal sounds by minimizing reconstruction errors. However, it may lead to misjudgments when there is some same information between normal and anomalous sound. Dohi et al. [20] introduce a flow-based machine anomalous sound detection method using normalizing flow for self-supervised density estimation. While this method improves detection performance, it necessitates the training of multiple detection models for each machine ID, limiting model generalization.

With the rapid advancement of machine learning, many recent studies on machine anomalous sound detection have embraced deep learning methods. Among them, the lightweight network MobileNet series [21,22,23] has delivered exceptional results in tasks such as face recognition, image classification, and target detection. These MobileNet-based networks are characterized by a rapid computation speed, high flexibility, and lightness. The MobileNetV2 network [22] was first applied to anomalous sound detection in [24], achieving the top-ranking model in DCASE Task 2. However, the focus was primarily on improving the network’s loss function and did not conduct in-depth research on the ability of the detection network in terms of feature extraction. Liu et al. [25] propose an improved MobileNetV3 network to address the issue of low accuracy in current gas leakage detection methods. By introducing an improved attention network, the researchers enhance the attention of the network to the crucial areas of the feature images, offering fresh insights into improving the detection accuracy and efficiency of gas leakage detection.

Therefore, there are two main challenges in unsupervised ASD tasks in industrial scenarios. First, the machine sound data in the industrial production environments contain a large amount of environmental noise. Existing acoustic feature extraction methods cannot highlight the key machine sound signals, resulting in an insufficient information expression of input features. Secondly, normal machine sound data samples can be so similar that it becomes difficult to establish effective distribution boundaries without capturing efficient normal machine acoustic features. The MobileNetV3 Network includes a Squeeze-and-Excitation (SE) attention module [26], in which the feature matrix is initially dimensioned down and then dimensioned up to obtain the weight channel importance. The dimensionality reduction operation can result in the loss of some sound signal information, leading to subpar detection results for some machines.

To address these issues, this paper proposes a machine anomalous sound detection method using the lMS spectrogram and ES-MobileNetV3 network. The main contributions of this paper are as follows:

(1): A novel acoustic feature combination strategy, the lMS spectrogram, is proposed to express machine sound. The lMS spectrogram is obtained by concatenating the log-Mel spectrogram and the SincNet spectrogram. It provides a more comprehensive data description, better capturing the various attributes of machine sound and enhancing the reliability and robustness of feature expression;
(2): An improved machine anomalous sound detection network, ES-Mobilenetv3, is proposed. It amplifies the feature information processing ability of the network by incorporating the ECA attention module with the SoftPool method. This network is anticipated to deliver exceptional anomaly detection performance in noisy industrial environments.

The rest of this study is organized as follows: Section 2 briefly describes the related work of anomalous sound research. Section 3 introduces the architecture of the proposed method and explains the technical scheme, principle, and function of the method in detail. Section 4 verifies the rationality of the proposed method with experiments. Finally, Section 5 contains the conclusions and future works.

2. Related Work

The ASD system typically consists of two key aspects: acoustic feature extraction and detection network training. The system begins by extracting relevant acoustic features from the raw input waveform. Subsequently, a neural network is employed to learn the feature distribution of normal sounds, enabling it to differentiate between anomalous and normal sounds during detection. Therefore, this section will primarily delve into the related work of feature extraction and detection models with regard to the ASD task.

2.1. Acoustic Features

With the growing research in fields like speech recognition, sentiment analysis, and anomaly monitoring, there is an increasing emphasis on the precise acquisition and comprehension of sound input features. The rapid advancements in raw wave feature extraction methods have laid a strong foundation for various practical applications based on sound signals. Kuo et al. [27] used the log-Mel spectrogram as an input feature to detect harmonic drive anomalies. The log-Mel spectrogram feature can better simulate the hearing characteristics of the human ear, and convert the linear interval on the frequency axis to the nonlinear interval on the Mel scale, which is more in line with the perception of the human ear and retains enough acoustic information. However, it might filter out high-frequency components of anomalous sound, where distinct features may be present. Li et al. [12] propose a multi-stage deep autoencoder network (DAN) that combines the information of three input features, an MFCC, Gabor filter bank, and bark filter bank, to detect abnormal sounds in traffic scenes. It identifies new potential insights through the nonlinear transformation and dimensionality reduction processes of the DAN, ultimately producing deep audio representations (DARs). The DAR integrates the complementary information of all input features through multiple-stage DANs. This approach integrates the strengths of input features, enhancing the overall representation. However, too many input features have been used, so calculating the potential insights of multiple features requires a tremendous amount of computation. Liu et al. [28] propose a machine anomalous sound detection method based on a spectral–temporal information fusion model. To compensate for the lack of high-frequency information in the log-Mel spectrogram, the author suggests integrating the temporal feature (Tgram) extracted by the TgramNet network. This method effectively compensates for anomalous information that may be absent in the log-Mel spectrogram. However, in comparison to the log-Mel spectrogram, the performance of detection may be influenced by the noise present in the temporal information represented by Tgram features.

In 2018, Ravanelli et al. [29] designed an architecture for extracting acoustic features called SincNet. The SincNet filters used in this architecture are the inverse Fourier transform of some rectangular band-pass filters, inspired by band-pass filters in the field of signal processing. Chang et al. [30] extended this concept by designing a multi-scale SincNet (MS-SincNet) 2D representation extraction network based on the SincNet filter. The MS-SincNet network autonomously learns filter parameters and exhibits a degree of anti-noise ability. This underscores the capability of the SincNet spectrogram, generated by the MS-SincNet network, to effectively capture fine machine sound signal details that the log-Mel spectrogram might overlook. It can distinguish machine sounds from background noise, making it a valuable asset for the task of anomalous sound detection. Therefore, exploring the information representation ability of the SincNet spectrogram in anomalous sound detection is of great significance.

2.2. Detection Network

According to the rapid development of deep learning, convolutional neural networks (CNNs) have demonstrated outstanding performance in the domain of anomalous sound detection. CNNs exhibit superior feature extraction capabilities, data understanding, and computational efficiency compared to other neural networks. They offer clear advantages, particularly in real-time tasks such as sound classification and abnormal detection. Pham et al. [31] introduced a CNN architecture tailored for respiratory anomalous sound detection. This architecture leverages the potent feature extraction capabilities of the CNN to discern fine texture details within respiratory acoustic features. This work provides valuable insights for the clinical application of the respiratory anomalous sound detection model. Wang et al. [32] harnessed the MobileNetV3 network to extract refined feature representations from machine sound signals.

They adopt global depthwise convolution (GDConv) instead of global pooling, which significantly improves the performance of the anomalous sound detection system. In recent years, attention-based techniques have gained prominence in the field of anomalous sound detection. The attention mechanism empowers models to concentrate effectively on vital segments of input data, thereby augmenting sound detection accuracy. Mori et al. [33] proposed a method for anomalous sound detection using an attention-based autoencoder. By assigning different weights to each audio frame within the input sound, the sound signal extraction ability and understanding ability of the network are enhanced. Koizumi et al. [34] proposed an attention network called SPIDERnet for a one-shot anomaly detection task. The SPIDERnet attention network focuses on critical aspects of the input sound data, constructing a robust similarity function for time-frequency structure changes by absorbing time-frequency stretching.

Drawing inspiration from the works mentioned above, this paper augments the feature extraction capabilities of the MobileNetV3 network through the utilization of the ECA attention module and the SoftPool method, referred to as the ES attention module. This method captures local channel information via the ES attention module, determining the significance of each feature channel and suppressing irrelevant feature information for the current task.

3. Proposed Method

The overall framework of the proposed method in this paper is depicted in Figure 2. The log-Mel spectrogram and SincNet spectrogram are extracted from the raw wave. Subsequently, these two spectrograms are concatenated into the lMS spectrogram and fed into the ES-MobileNetV3 detection network to calculate the anomaly score [13] for that input feature. Ultimately, the effectiveness of the proposed method is comprehensively evaluated using the anomaly detection performance index. The following section provides an overview of the lMS spectrogram extraction process and delves into the architecture of the ES-MobileNetV3 network.

3.1. Feature Extraction

Acoustic feature extraction is a crucial step in anomalous sound event detection. The extraction of acoustic features with abundant information and robust noise resistance can significantly enhance the performance of anomalous sound event detection. Figure 3 describes in detail the lMS spectrogram extraction process.

3.1.1. log-Mel Spectrogram Extraction

The log-Mel spectrogram is the most commonly used input feature of ASD systems. This spectrogram can capture the global characteristics of a sound, offering a concise representation of the overall frequency distribution in the machine sound signal, and can align with human auditory perception.

Let

x (n)

be the input original sound signal. Initially, the original sound signal is pre-processed by framing and windowing, and then the fast Fourier transform (FFT) is applied on each frame of data. It can be vividly seen that discerning the signal’s features in the time domain turns out to be challenging; the FFT is used to convert the signal from the time domain to the frequency domain. The calculation process can be written as follows:

X (k) = \sum_{n = 0}^{N - 1} x (n) \exp (\frac{- j 2 π k n}{N}), 0 \leq k \leq N - 1

(1)

where

N

denotes the number of sample points for the FFT and

X (k)

denotes the value of the

k

-th frequency point. Then, it is supposed to take the modulus square of the resulting complex spectrogram to obtain the power spectrogram, and the Mel spectrogram can be obtained by passing the power spectrogram through the Mel filter bank. The calculation process can be written as follows:

S (m) = In (\sum_{k = 0}^{N - 1} {|X (k)|}^{2} H_{m} (k)), 0 \leq m \leq M

(2)

where

H (k)

denotes the transfer function of the Mel filter bank;

M

denotes the number of filters in the Mel filter bank. Finally, the log-Mel spectrogram can be obtained via the logarithmic energy processing of the Mel spectrogram.

3.1.2. SincNet Spectrogram Extraction

The low and high cut-off frequencies are the only parameters of the SincNet filter learned from data. Consequently, the SincNet spectrogram can enhance frequency axis representation in machine sound signals. Aa a result, the SincNet spectrogram complements the log-Mel spectrogram with acoustic signals along the frequency axis.

In the field of signal processing, the SincNet filters are rectangular (ideal) band-pass filters. The frequency response of a band-pass filter can be described as the difference between two rectangular low-pass filters. The calculation can be written as follows:

G (f, f_{1}, f_{2}) = r e c t (\frac{f}{2 f_{2}}) - r e c t (\frac{f}{2 f_{1}})

(3)

where

f_{1}

,

f_{2}

are the low and high cut-off frequencies, and

r e c t (\cdot)

is the rectangular function in the magnitude frequency domain. By implementing inverse Fourier transform on the filter function

G

, we can obtain the impulse response of the filter, which is expressed by the sinc function:

g (n, f_{1}, f_{2}) = 2 f_{2} sinc (2 π f_{2} n) - 2 f_{1} sinc (2 π f_{1} n), n = 1, 2, ..., L .

(4)

where

L

is the filter length; the sinc function is defined as

\sin c (x) = \frac{\sin x}{x}

. Since function

g

is infinitely long in the time domain, a direct calculation would result in the leakage of spectral energy. To mitigate this problem, the Hamming window is used to truncate function

g

and to smooth out the abrupt discontinuities at the end of

g

. The calculation can be written as follows:

g^{w} (n, f_{1}, f_{2}) = g (n, f_{1}, f_{2}) \cdot w (n)

(5)

where

w (n)

denotes the Hamming window, and

g^{w}

is a time-domain window function of the SincNet filter. The feature extraction operation of the SincNet filters can be expressed as follows:

x_{k} [n] = x [n] * g_{k}^{w} (n), k = 1, 2, ..., K

(6)

where

x (n)

is the input original sound signal,

K

denotes the number of SincNet filters, and

g_{k}^{w} (n)

is the

k

-th SincNet filter. Then, 1D batch normalization and the ReLU nonlinear activation function are applied to obtain the filter output. In order to obtain the SincNet spectrogram with the same dimension size as the log-Mel spectrogram, we apply adaptive average pooling for each filter output,

x_{k} [n]

. The calculation process can be written as follows:

m_{k} [n] = A d a p t i v e A v g P o o l (x_{k} [n])

(7)

In this study, the length of the SincNet filter is L = 251 (each filter consists of 251 coefficients).

3.1.3. Spectrogram Fusion

Thereinto, an effective representation lMS spectrogram,

F_{l M S}

, is obtained using a novel fusion strategy by concatenating the log-Mel spectrogram,

F_{l M}

, and the SincNet spectrogram,

F_{S}

. The calculation can be written as follows:

F_{l M S} = Concatenate (F_{l M}, F_{S})

(8)

Figure 4a,b are the log-Mel spectrogram and SincNet spectrogram of the machine Fan running sound, respectively. We can see from the figure that the log-Mel spectrogram can more effectively capture the global information of the machine sound signal. The SincNet spectrogram has a higher degree of energy concentration, exhibits noticeable harmonic-related and percussive-related features, and retains more critical information about the machine sound signal.

The above operations are implemented using Librosa.

3.2. Anomaly Detection Network

In this study, we have chosen to employ MobileNetV3 [23] as the primary architecture for the sound detection network. The MobileNetV3 network builds upon the foundation of its predecessors, incorporating the deep separable convolution from MobileNetV1 and integrating the linear bottleneck and inverted residual structures of MobileNetV2. This approach effectively reduces computational demands while maintaining a compact structure. Furthermore, the MobileNetV3 network introduces the SE attention module within the inverted residual structure. The SE attention module dynamically adjusts the importance of different channels in the feature map by learning a set of channel weights. This dynamic adjustment facilitates feature selection and enhancement. However, in order to minimize computational overhead, the SE attention module conducts the dimensionality reduction operation on the feature matrix at the fully connected layer. This approach will lead to the loss of some information of the original data, resulting in the network not accurately restoring the original information. As a result, it may affect network performance while inevitably increasing the complexity of the model and the amount of calculation.

In light of these considerations, this paper proposes an improved ES-MobileNetV3 network model for machine anomalous sound detection. The primary structure of the model is depicted in Figure 5. We have made modifications to the SE attention module of the MobileNetV3 network, replacing it with the Efficient Channel Attention (ECA) attention module [35], and implementing SoftPool [36] to replace the global average pooling of the ECA attention module. The ECA attention module presents a novel strategy for local cross-channel interaction, eliminating dimensionality reduction operations to obtain channel weights through adaptive one-dimensional convolution. The authors of [35] demonstrated the significance of avoiding dimensionality reduction in learning channel weights. Effective cross-channel interaction can reduce model complexity while preserving strong performance. By adaptively selecting the size of the one-dimensional convolutional kernel, we can define the scope of local cross-channel interaction, resulting in both reduced computational demands and performance optimization.

In the ECA attention module, a global average pooling method is used to reduce the size of the feature map. Common pooling methods typically involve maximum pooling and average pooling, both of which will lead to the loss of some useful information. The average pooling performs average operations on the feature points in the feature region, which reduces the amount of computation but weakens the feature representation of the sound data. To address this issue, our approach aims to maintain more acoustic feature information during the down-sampling process, thus enhancing detection accuracy. We use the SoftPool operation instead of the average pooling operation in the ECA attention module. SoftPool continually updates gradient values during backpropagation, minimizing information loss in the pooling process while preserving the functionality of the pooling layer.

The calculation process of SoftPool is shown in Figure 6. The feature region is defined as

P

, and the size of the pooling kernel (

R

) is

k \times k

. The SoftPool operation calculates the corresponding feature weight according to the eigenvalues. The calculation process can be described as follows:

w_{i} = \frac{e^{a_{i}}}{\sum_{j \in R} e^{a_{j}}}

(9)

where

w_{i}

denotes the weight,

a_{i}

denotes the value of a feature point, and the weight is calculated to ensure that the feature texture information can be transmitted more effectively. The output value of the SoftPool operation is produced through a standard summation of all weighted activations within the pooling kernel neighborhood,

R

. The calculation process can be described as follows:

\tilde{a} = \sum_{i \in R} w_{i} * a_{i}

(10)

where

\tilde{a}

denotes the output value of the feature point after passing through the SoftPool operation. SoftPool has the advantages of both average pooling and maximum pooling, which can reduce the risk of losing some important feature information during the pooling process and enhance the expression of prominent features.

The ES attention module is the ECA attention module using the SoftPool method. The ES attention module structure is shown in Figure 7. Firstly, the features are aggregated via the SoftPool operation to obtain the information of the channel, and then the channel dimension (

C

) adaptively calculates the number of cross-channel interactions,

k

; the adaptive function can be written as follows:

k = ψ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(11)

where

γ = 2

and

b = 1

.

{|t|}_{o d d}

indicates the nearest odd number of

t

. Finally, the weight of each channel is obtained using the Sigmoid function, and the final output feature map is obtained by multiplying the channel weight with the corresponding elements of the input feature.

4. Experiment and Analysis

4.1. Dataset

We evaluate our method using the DCASE 2020 challenge Task2 dataset [13]. The dataset consists of the ToyADMOS dataset [37] and the MIMII dataset [38], including normal and anomalous sounds of six machine types: ToyCar, ToyConveyor, Valve, Pump, Fan, and Slider. Except for ToyConveyor, which includes six machine IDs, the other types contain seven machine IDs. Here, the machine ID is used to identify different machines with the same machine type. The collected sound signals include the running sound of the target mechanical equipment and the background noise of the real factory. Each sound sample is around 10 s long in the dataset. They are sampled at 16 kHz. The training set only contains the running sound of the machine in a normal state, and the test set consists of normal sounds and various anomalous-state running sounds.

4.2. Evaluation Metrics

The performance is evaluated with the area under the receiver operating characteristic (ROC) curve (AUC) and the partial AUC (pAUC) [13]. The AUC is a common anomaly detection performance index; the pAUC is calculated as the AUC over a low false-positive-rate (FPR) range [0, p]. The calculation formulas of the AUC and pAUC are expressed as follows:

AUC = \frac{1}{N_{-} N_{+}} \sum_{i = 1}^{N_{-}} \sum_{i = j}^{N_{+}} H (S (x_{j}^{+}) - S (x_{i}^{-}))

(12)

pAUC = \frac{1}{⌊p N_{-}⌋ N_{+}} \sum_{i = 1}^{⌊p N_{-}⌋} \sum_{i = j}^{N_{+}} H (S (x_{j}^{+}) - S (x_{i}^{-}))

(13)

where

H (a)

is the hard-threshold function that returns 1 when

a > 0

and 0 otherwise.

x_{i}^{-}

and

x_{j}^{+}

are normal and anomalous test samples, respectively.

N_{-}

and

N_{+}

are the numbers of normal and anomalous test samples, respectively.

⌊\cdot⌋

is the flooring function, and

p = 0.1

in this study.

4.3. Experimental Setup

All experiments are conducted using PyTorch 1.13.0 and Python 3.7.13, and the development platform utilized is Pycharm. The experiments are carried out on a system featuring an Intel Core i7-12700H CPU and an NVIDIA RTX 3070 (8 GB) GPU.

During the extraction of the log-Mel spectrogram, the frame length is set to 1024, the overlap length is 512, and the number of Mel filters is configured as 128. For the extraction of the SincNet spectrogram, 128 SincNet filters are employed, and the adaptive average pooling is set to 313. Consequently, both the log-Mel spectrogram and SincNet spectrogram are obtained with a dimension of 128 × 313. The Adam optimizer [39] is utilized for model training with a learning rate of 0.00001, and the ArcFace loss function [40] is applied. The model is trained over 200 epochs with a batch size of 64.

4.4. Comparison of Different Input Features

To investigate the impact of different input features on detection performance and validate the effectiveness of the lMS spectrogram, we conduct ablation experiments using the log-Mel spectrogram, SincNet spectrogram, and lMS spectrogram as input features for the MobileNetV3 network. The experimental results are shown in Figure 8. Analyzing the results, we observe that when the log-Mel spectrogram is employed as the input feature, the AUC and pAUC in the machine types Slider, ToyCar, and ToyConveyor are higher than those with the SincNet spectrogram as the input feature. However, the AUC and pAUC for the machine types Valve, Pump, and Fan are higher when using the SincNet spectrogram as the input feature. This suggests that the log-Mel spectrogram and the SincNet spectrogram are complementary in the feature extraction of machine acoustic signals. Notably, when the lMS spectrogram is employed as the input feature, there are significant improvements in AUC and pAUC for all machine types, particularly for Valve and Fan, where AUC and pAUC improve by 6.99% and 9.58%, respectively, compared to using the log-Mel spectrogram as the input feature alone. However, the improvement for ToyConveyor is less pronounced. This may be due to the high similarity of normal audio data in ToyConveyor and the introduction of highly repetitive information due to multi-feature input, leading to overfitting during training.

To vividly illustrate the complementarity of the two acoustic feature signal extraction methods, we use the example of Fan and provide t-distribution stochastic neighbor embedding (t-SNE) cluster visualizations of the latent features extracted from the log-Mel spectrogram, SincNet spectrogram, and lMS spectrogram in Figure 9. In Figure 9a,b, when using the log-Mel spectrogram as the input feature, the normal and anomalous samples of ID00 overlap, while the normal and anomalous samples of ID02 are well distinguished. In contrast, in Figure 9c,d, where the SincNet spectrogram is the input feature, the normal and anomalous samples of ID00 can be well distinguished, but the normal and anomalous samples of ID02 overlap. However, in Figure 9e,f, both the normal and anomalous samples of ID00 and ID02 can be effectively differentiated. This demonstrates that the SincNet spectrogram complements the feature information filtered by the log-Mel spectrogram, while the lMS spectrogram provides greater discrimination capabilities.

4.5. Comparison of Different Detection Models

In order to further investigate the advantages of the ES attention module proposed in this study, we utilize the lMS spectrogram as the input feature and employed the MobileNetV3 network, ECA-MobileNetV3 network, and ES-MobileNetV3 network as the machine anomalous sound detection networks for the ablation study. The experimental results are presented in Figure 10. From Figure 10, it is evident that the ES-MobileNetV3 network proposed in this paper achieves the best results across the six different machine types. The AUC and pAUC reach 96.67% and 92.38%, respectively, which are 4.53% and 4.85% higher than the original MobileNetV3 network. Notably, for Pump and ToyConveyor, the AUC increases by 6.34% and 6.92%, respectively, and the pAUC increases by 6.64% and 8.26%, respectively. Despite the similarity in normal audio data from the ToyConveyor machine, the SoftPool layer in the detection network retains the fine features of the machine sound and enhances the recognition of the feature information. This demonstrates that the ES attention module proposed in this paper can capture sample features to the fullest extent and detect the key machine sound information. Comparing the results of the ES-MobileNetV3 network and the ECA-MobileNetV3 network, it can be observed that the AUC and pAUC for most machines are improved. However, the detection performance of the Valve machine decreased slightly. Reference [38] suggests that the sound signal of the Valve machine is non-stationary, so we speculate that the SoftPool method used in the ES-MobileNetV3 network may not be as effective in handling non-stationary sound signals. Through these experiments, it has been confirmed that the ES-MobileNetV3 network effectively optimizes the performance of the anomalous sound detection system. The reason is that the ECA network avoids the dimensionality reduction operation in the process of feature information processing and reduces information loss. Additionally, the SoftPool method preserves more feature information during the pooling process, addressing the issue of the insufficient feature extraction ability of the lightweight network.

Furthermore, we conduct a comparative analysis to assess the detection accuracy, model parameters, and FLOPs (floating point operations) both before and after enhancing the network, using the machine anomalous sound dataset. Moreover, we also compare the results to those obtained using the MobileNetV2 network. As the predecessor of the MobileNetV3 network, the MobileNetV2 network is a well-established lightweight network and has been widely used in various mobile devices. The comparison results are presented in Table 1.

4.6. Comparison of Different Detection Methods

To further substantiate the merits of the proposed method, we conduct a comparative analysis, pitting the machine anomaly detection performance of our method against other mainstream approaches. The comparative results are detailed in Table 2.

As illustrated in Table 2, Ref. [13] represents the official baseline model provided by DCASE2020 Task 2, while [15,17,28] are the detection methods with x-vectors, the log-Mel spectrogram, and spectral–temporal information fusion as input features, respectively. Additionally, other researchers have proposed machine anomalous sound detection networks [19,20,24,32]. Table 2 reveals that our proposed method achieves the best detection results in terms of both the average AUC and the average pAUC. Furthermore, our method outperforms other approaches for the specific machine types of Pump, ToyCar, and ToyConveyor. However, it should be noted that even the method with the highest average AUC and average pAUC fails to attain the best detection performance across all machine types. Notably, the ToyConveyor machine presents a challenge in detecting an effective distribution boundary due to the high similarity of normal acoustic features, resulting in its detection accuracy being far lower than the other five machine types. Therefore, how to better calculate the distribution boundary of the ToyConveyor machine is the key to further improving the comprehensive performance of the system.

5. Conclusions

This paper proposes a machine anomalous sound detection method using the lMS spectrogram and ES-MobileNetV3 network. The proposed method involves the extraction of both the log-Mel spectrogram and the SincNet spectrogram from the raw wave, which are concatenated to form a more discriminative machine sound feature lMS spectrogram. This feature is capable of representing a broader range of machine sound information and enhancing the distinctive characteristics of machine sound data. Additionally, an ES attention module is proposed to optimize the linear bottleneck structure of the MobileNetV3 network, which avoids the problem of information loss caused by the dimensionality reduction operation and retains richer feature information. The effectiveness of this method is validated through ablation experiments, demonstrating remarkable results on the DCASE 2020 Task 2 dataset.

However, it should be acknowledged that this approach may not yield the best results for every machine type. In future work, our goal is to explore multi-modal information fusion methods that are well-suited for a broader range of machines, ultimately enhancing the generalization and robustness of the machine anomalous sound detection system.

Author Contributions

Conceptualization, M.W., X.S., and Q.M.; software, Q.M., M.W., and X.L.; experiment analysis, Q.M., F.Y., and J.X.; writing—review and editing, Q.M., X.S., and R.K.; funding acquisition, M.W., and H.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (61961010 and 62071135); Projects from the Key Laboratory of Cognitive Radio and Information Processing, the Ministry of Education, and Guilin University of Electronic Technology (CRKL220105, CRKL200111, and CRKL220204); and the ‘Ba Gui Scholars’ program of the provincial government of Guangxi.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found at https://dcase.community/challenge2020/ (accessed on 20 October 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Golda, T.; Guaia, D.; Wagner-Hartl, V. Perception of Risks and Usefulness of Smart Video Surveillance Systems. Appl. Sci. 2022, 12, 10435. [Google Scholar] [CrossRef]
Marques, G.; Pitarma, R. A Real-Time Noise Monitoring System Based on Internet of Things for Enhanced Acoustic Comfort and Occupational Health. IEEE Access 2020, 8, 139741–139755. [Google Scholar] [CrossRef]
Liu, Z.; Li, S. A sound monitoring system for prevention of underground pipeline damage caused by construction. Autom. Constr. 2020, 113, 103125. [Google Scholar] [CrossRef]
Chung, Y.; Oh, S.; Lee, J.; Park, D.; Chang, H.-H.; Kim, S. Automatic Detection and Recognition of Pig Wasting Diseases Using Sound Data in Audio Surveillance Systems. Sensors 2013, 13, 12929–12942. [Google Scholar] [CrossRef] [PubMed]
Du, X.; Lao, F.; Teng, G. A Sound Source Localisation Analytical Method for Monitoring the Abnormal Night Vocalisations of Poultry. Sensors 2018, 18, 2906. [Google Scholar] [CrossRef] [PubMed]
Nasir, V.; Cool, J.; Sassani, F. Intelligent Machining Monitoring Using Sound Signal Processed with the Wavelet Method and a Self-Organizing Neural Network. IEEE Robot. Autom. Lett. 2019, 4, 3449–3456. [Google Scholar] [CrossRef]
Lu, Z.; Wang, M.; Dai, W.; Sun, J. In-process complex machining condition monitoring based on deep forest and process information fusion. Int. J. Adv. Manuf. Technol. 2019, 104, 1953–1966. [Google Scholar] [CrossRef]
Khoruamkid, S.; Visitsattapongse, S. A Low-Cost Digital Stethoscope for Normal and Abnormal Heart Sound Classification. In Proceedings of the 14th Biomedical Engineering International Conference (BMEiCON), Songkhla, Thailand, 10–13 November 2022; pp. 1–6. [Google Scholar]
Bailoor, S.; Seo, J.H.; Schena, S.; Mittal, R. Detecting Aortic Valve Anomaly from Induced Murmurs: Insights from Computational Hemodynamic Models. Front. Physiol. 2021, 12, 734224. [Google Scholar] [CrossRef] [PubMed]
Nassif, A.B.; Talib, M.A.; Nasir, Q.; Dakalbab, F.M. Machine learning for anomaly detection: A systematic review. IEEE Access 2021, 9, 78658–78700. [Google Scholar] [CrossRef]
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the case western reserve university data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Zhang, Y.; Liu, M.; Wang, W. Anomalous Sound Detection Using Deep Audio Representation and a BLSTM Network for Audio Surveillance of Roads. IEEE Access 2018, 6, 58043–58055. [Google Scholar] [CrossRef]
Koizumi, Y.; Kawaguchi, Y.; Imoto, K.; Nakamura, T.; Nikaido, Y.; Tanabe, R.; Purohit, H.; Suefusa, K.; Endo, T.; Yasuda, M.; et al. Description and discussion on DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring. arXiv 2020, arXiv:2006.05822. [Google Scholar]
Koizumi, Y.; Saito, S.; Uematsu, H.; Harada, N. Optimizing Acoustic Feature Extractor for Anomalous Sound Detection Based on NeymanPearson Lemma. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017. [Google Scholar]
Wilkinghoff, H. Anomalous Sound Detection with Look, Listen, and Learn Embeddings. In Tech. Report in DCASE2020 Challenge Task 2; Detection and Classification of Acoustic Scenes and Events: Munich, Germany, 2020. [Google Scholar]
Perez-Castanos, S.; Naranjo-Alcazar, J.; Zuccarello, P.; Cobos, M. Anomalous sound detection using unsupervised and semi-supervised autoencoders and gammatone audio representation. arXiv 2020, arXiv:2006.15321. [Google Scholar]
Van Truong, H.; Hieu, N.C.; Giao, P.N.; Phong, N.X. Unsupervised Detection of Anomalous Sound for Machine Condition Monitoring using Fully Connected U-Net. J. ICT Res. Appl. 2021, 15, 41–45. [Google Scholar] [CrossRef]
Cui, P.; Luo, X.; Li, X.; Luo, X. Research on the enhancement of machine fault evaluation model based on data-driven. Int. J. Metrol. Qual. Eng. 2022, 13, 13. [Google Scholar] [CrossRef]
Suefusa, K.; Nishida, T.; Purohit, H.; Tanabe, R.; Endo, T.; Kawaguchi, Y. Anomalous Sound Detection Based on Interpolation Deep Neural Network. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–9 May 2020; pp. 271–275. [Google Scholar]
Dohi, K.; Endo, T.; Purohit, H.; Tanabe, R.; Kawaguchi, Y. Flow-Based Self-Supervised Density Estimation for Anomalous Sound Detection. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 336–340. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G. Searching for Mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–3 November 2019; pp. 1314–1324. [Google Scholar]
Giri, R.; Tenneti, S.V.; Cheng, F.; Helwani, K.; Isik, U.; Krishnaswamy, A. Unsupervised Anomalous Sound Detection Using SelfSupervised Classification and Group Masked Autoencoder for Density Estimation. In Tech. Report in DCASE2020 Challenge Task 2; Detection and Classification of Acoustic Scenes and Events: Palo Alto, CA, USA, 2020. [Google Scholar]
Liu, P.; Xu, Y.; Wang, Y.; Yu, Y. Gas Leak Fault Detection Based on Improved MobileNetV3. In Proceedings of the 7th International Conference on Transportation Information and Safety (ICTIS), Xi’an, China, 4–6 August 2023; pp. 2181–2189. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Kuo, J.-Y.; Hsu, C.-Y.; Wang, P.-F.; Lin, H.-C.; Nie, Z.-G. Constructing Condition Monitoring Model of Harmonic Drive. Appl. Sci. 2022, 12, 9415. [Google Scholar] [CrossRef]
Liu, Y.; Guan, J.; Zhu, Q.; Wang, W. Anomalous Sound Detection Using Spectral-Temporal Information Fusion. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 816–820. [Google Scholar]
Ravanelli, M.; Bengio, Y. Speaker Recognition from Raw Waveform with SincNet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 1021–1028. [Google Scholar]
Chang, P.C.; Chen, Y.S.; Lee, C.H. MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021; pp. 29–36. [Google Scholar]
Pham, L.; Phan, H.; Palaniappan, R. CNN-MoE Based Framework for Classification of Respiratory Anomalies and Lung Disease Detection. IEEE J. Biomed. Health Inform. 2021, 25, 2938–2947. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Zheng, Y.; Zhang, Y.; Xie, Y.; Xu, S.; Hu, Y.; He, L. Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Using Classification-Based Methods. Appl. Sci. 2021, 11, 11128. [Google Scholar] [CrossRef]
Mori, H.; Tamura, S.; Hayamizu, S. Anomalous Sound Detection Based on Attention Mechanism. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 581–585. [Google Scholar]
Koizumi, Y.; Yasuda, M.; Murata, S.; Saito, S.; Uematsu, H.; Harada, N. SPIDERnet: Attention Network for One-Shot Anomaly Detection in Sounds. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 281–285. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Stergiou, A.; Poppe, R.; Kalliatakis, G. Refining activation downsampling with SoftPool. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10337–10346. [Google Scholar]
Koizumi, Y.; Saito, S.; Uematsu, H.; Harada, N.; Imoto, K. ToyADMOS: A dataset of miniature-machine operating sounds for anomalous sound detection. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 308–312. [Google Scholar]
Purohit, H.; Tanabe, R.; Ichige, T.; Endo, T.; Nikaido, Y.; Suefusa, K.; Kawaguchi, Y. MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), Tokyo, Japan, 20 September 2019; pp. 209–213. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4685–4694. [Google Scholar]

Figure 1. Unsupervised machine anomalous sound detection system framework. Φ represents the threshold; when the anomaly score exceeds the Φ, the input is identified as anomalous.

Figure 2. The framework of the proposed method for anomalous sound detection.

Figure 3. The extraction process of lMS spectrogram.

Figure 4. The lMS spectrogram of the machine Fan running sound. (a) log-Mel spectrogram; (b) SincNet spectrogram.

Figure 5. The proposed ES-MobileNetV3 network architecture for machine anomalous sound detection.

Figure 6. The calculation process of SoftPool.

Figure 7. Structural diagram of the ES attention module.

Figure 8. Performance comparison for different input features. (a) AUC. (b) pAUC.

Figure 9. The t-SNE visualization of latent embeddings for test data of the machine type Fan. (a) log-Mel spectrogram (ID00), (b) log-Mel spectrogram (ID02), (c) SincNet spectrogram (ID00), (d) SincNet spectrogram (ID02), (e) lMS spectrogram (ID00), and (f) lMS spectrogram (ID02), respectively, show the visualization results of ID00 and ID02 in the test data when using log-Mel spectrogram, SincNet spectrogram, and lMS spectrogram, respectively. The “•” and “×” denote normal and anomalous classes, respectively.

Figure 10. Performance comparison for different detection networks. (a) AUC. (b) pAUC.

Table 1. Comparison of the evaluation indicators for the different network.

Model Name	Parameters (M)	FLOPs (M)	Accuracy
MobileNetV2	3.50	530.10	88.67
MobileNetV3	1.80	417.36	92.14
ES-MobileNetV3	1.53	421.37	96.67

Table 2. Performance comparison of different detection methods. The bold font indicates the best performance.

	Slider	Valve	Pump	Fan	ToyCar	ToyConveyor	Average
	AUC (pAUC)	AUC (pAUC)	AUC (pAUC)	AUC (pAUC)	AUC (pAUC)	AUC (pAUC)	AUC (pAUC)
Baseline [13]	84.76 (66.53)	66.28 (50.98)	72.89 (59.99)	65.83 (52.45)	78.77 (67.58)	72.53 (60.43)	73.51 (59.66)
x-vector [15]	95.71 (79.45)	94.87 (83.58)	93.19 (81.10)	97.35 (80.68)	94.06 (86.80)	84.22 (69.12)	92.63 (80.12)
log-Mel [17]	90.13 (73.97)	84.87 (61.38)	85.97 (71.10)	80.06 (58.61)	82.52 (66.34)	76.75 (55.65)	83.38 (64.51)
ST-gram [28]	99.55 (97.61)	99.64 (98.44)	91.94 (81.75)	94.04 (88.97)	94.44 (87.68)	74.57 (63.60)	92.36 (86.34)
IDNN [19]	86.45 (67.58)	84.09 (64.94)	73.76 (61.07)	67.71 (52.90)	78.69 (69.22)	71.07 (59.70)	76.96 (62.57)
MobileNetV2 [24]	95.27 (85.22)	88.65 (87.98)	82.53 (76.50)	80.19 (74.40)	87.66 (85.92)	69.71 (56.43)	84.34 (77.74)
Flow [20]	94.60 (82.80)	91.40 (75.00)	83.40 (73.80)	74.90 (65.30)	92.20 (84.10)	71.50 (59.00)	85.20 (73.90)
Classification [32]	99.97 (99.83)	95.82 (93.58)	97.35 (91.58)	99.96 (99.84)	92.02 (88.50)	89.80 (80.61)	95.82 (92.32)
Our Method	99.42 (98.67)	98.13 (96.28)	97.64 (93.36)	97.78 (94.31)	95.54 (89.54)	91.49 (82.13)	96.67 (92.38)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, M.; Mei, Q.; Song, X.; Liu, X.; Kan, R.; Yao, F.; Xiong, J.; Qiu, H. A Machine Anomalous Sound Detection Method Using the lMS Spectrogram and ES-MobileNetV3 Network. Appl. Sci. 2023, 13, 12912. https://doi.org/10.3390/app132312912

AMA Style

Wang M, Mei Q, Song X, Liu X, Kan R, Yao F, Xiong J, Qiu H. A Machine Anomalous Sound Detection Method Using the lMS Spectrogram and ES-MobileNetV3 Network. Applied Sciences. 2023; 13(23):12912. https://doi.org/10.3390/app132312912

Chicago/Turabian Style

Wang, Mei, Qingshan Mei, Xiyu Song, Xin Liu, Ruixiang Kan, Fangzhi Yao, Junhan Xiong, and Hongbing Qiu. 2023. "A Machine Anomalous Sound Detection Method Using the lMS Spectrogram and ES-MobileNetV3 Network" Applied Sciences 13, no. 23: 12912. https://doi.org/10.3390/app132312912

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Machine Anomalous Sound Detection Method Using the lMS Spectrogram and ES-MobileNetV3 Network

Abstract

1. Introduction

2. Related Work

2.1. Acoustic Features

2.2. Detection Network

3. Proposed Method

3.1. Feature Extraction

3.1.1. log-Mel Spectrogram Extraction

3.1.2. SincNet Spectrogram Extraction

3.1.3. Spectrogram Fusion

3.2. Anomaly Detection Network

4. Experiment and Analysis

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Comparison of Different Input Features

4.5. Comparison of Different Detection Models

4.6. Comparison of Different Detection Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI