Multi-Sensor Data Fusion Method Based on Self-Attention Mechanism

Lin, Xuezhu; Chao, Shihan; Yan, Dongming; Guo, Lili; Liu, Yue; Li, Lijuan

doi:10.3390/app132111992

Open AccessArticle

Multi-Sensor Data Fusion Method Based on Self-Attention Mechanism

¹

Key Laboratory of Optoelectronic Measurement and Control and Optical Information Transmission Technology of the Ministry of Education, School of Optoelectronic Engineering, Changchun University of Science and Technology, Changchun 130022, China

²

Zhongshan Research Institute, Changchun University of Science and Technology, Zhongshan 528400, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(21), 11992; https://doi.org/10.3390/app132111992

Submission received: 18 September 2023 / Revised: 18 October 2023 / Accepted: 26 October 2023 / Published: 3 November 2023

(This article belongs to the Topic Complex Systems and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The proposed multi-sensor data fusion method utilizes the self-attention mechanism in CNN-SA networks to enhance data integrity and accuracy, and it has potential applications in various fields where accurate and reliable multi-sensor data fusion is required.

Abstract

In 3D reconstruction tasks, single-sensor data fusion based on deep learning is limited by the integrity and accuracy of the data, which reduces the accuracy and reliability of the fusion results. To address this issue, this study proposes a multi-sensor data fusion method based on a self-attention mechanism. A multi-sensor data fusion model for acquiring multi-source and multi-modal data is constructed, with the core component being a convolutional neural network with self-attention (CNN-SA), which employs CNNs to process multi-source and multi-modal data by extracting their features. Additionally, it introduces an SA mechanism to weigh and sum the features of different modalities, adaptively focusing on the importance of different modal data. This enables mutual support, complementarity, and correction among the multi-modal data. Experimental results demonstrate that the accuracy of the CNN-SA network is improved by 72.6%, surpassing the improvements of 29.9% for CNN-CBAM, 23.6% for CNN, and 11.4% for CNN-LSTM, exhibiting enhanced generalization capability, accuracy, and robustness. The proposed approach will contribute to the effectiveness of multi-sensor data fusion processing.

Keywords:

multi-model; data fusion; convolutional neural network; self-attention mechanism; deep learning

1. Introduction

3D reconstruction has a wide range of applications in various application areas, including digital twin, virtual reality, and engineering design. Its primary objective is to restore the 3D structure of a target object or scene. The continuous evolution of sensor technologies has made it possible to obtain high-quality and accurate point cloud and image data [1]. However, due to the rapid development of digital information and the increasing complexity of neural networks, relying solely on a single sensor is insufficient for capturing the rich features of target objects. Laser scanning techniques are widely employed in 3D reconstruction due to their ability to provide highly accurate 3D scanning data with sufficient geometric detail. However, 3D-scanned point clouds often lack color and texture information, making it challenging to capture fine details, especially in cases involving occlusion. Additionally, acquiring precise data necessitates careful planning of the scanner’s measurement positions. Laser tracking systems provide high-precision 3D measurements of target reflectors with six-degree-of-freedom (6DoF) capabilities, which can provide accurate information about the position of the target object. Nevertheless, data collected with laser tracking systems are sparse and lack detailed surface characterization. RGB-D images acquired using cameras and depth sensors have gained widespread use in 3D vision applications. Compared with 3D point cloud data, RGB images contain rich geometric information, and combining 3D point clouds with color information allows for the extraction of comprehensive geometric and textural features [2]. However, RGB-D sensors are limited by scanning range and high noise level [3]. To overcome these limitations, multiple sensors have been employed to measure different regions and angles of the same target, resulting in the acquisition of multi-source and multi-modal data. The fusion of such data enables more accurate reconstruction of target objects.

Recently, multi-modal data fusion has become a research hotspot in the field of computer vision. Its objective is to integrate various data from different sources, types, and forms to enhance data accuracy and precision, thereby facilitating more effective data utilization. Data fusion, based on multi-source and multi-modal techniques, can acquire the overall characteristics of target objects [4]. The widespread deployment of multi-modal sensors in various fields has generated large amounts of data, which are characterized by their high volume, variety, and integrity [5]. In the context of data, multi-modal refers to the presence of different forms of data representation or different formats within the same form [6]. Multi-modal data fusion is a technique that combines information obtained from multiple sensors, data sources, or different regions and angles into a consolidated presentation [7]. The integration of information from different sources reduces the uncertainty of individual data points and provides a more comprehensive and complete description of features. Recently, deep learning-based multi-modal data fusion methods have demonstrated the ability to directly map raw multi-modal data to end-to-end strategies. This approach not only simplifies the complexity of data fusion implementation, but also effectively utilizes input data from different sensors, such as depth cameras and lasers. Nguyen et al. [8] proposed a complex environment autonomous navigation method based on a deep multi-modal fusion network that fuses three visual modalities, namely, LiDAR data, RGB images, and point clouds, to enhance the perception ability of robots in complex environments. Zhang et al. [9] proposed a multi-modal object recognition method that fuses multiple deep learning models, significantly improving the accuracy of object recognition. Li et al. [10] proposed an adaptive fusion method for multi-source data that effectively and adaptively fuses multiple data sources and extracts valuable information based on the size of the convolution kernel. Poliyapram et al. [11] proposed an end-to-end deep neural network for the point-by-point fusion of LiDAR point clouds and aerial imagery, thereby improving the accuracy of three-dimensional (3D) segmentation and merging multi-view 3D scan data. Rosas-Cervantes et al. [12] proposed a multi-modal fusion approach for 3D robot trajectory estimation that combines two-dimensional (2D) features extracted from color images with 3D features extracted from point clouds. To achieve the more effective fusion of RGB images and point cloud features, Wu et al. [13] proposed a multi-layer fusion model that iteratively combines features from multiple convolutional layers and effectively fuses global and local features. Caltagirone et al. [14] proposed a road detection method based on deep learning that achieved outstanding performance by fusing a laser point cloud and camera images using the proposed cross-fusion strategy. Zhu et al. [15] presented a novel 3D object detection algorithm based on cameras and LiDAR, which integrated point cloud features and image feature sampling. Further, they introduced a multi-head attention mechanism for feature fusion. Zhang et al. [16] utilized the data information of different sensors and the channel attention mechanism to enable multi-modal data to obtain reasonable fusion weights in the fusion of feature channels to better utilize the correlation and complementarity of different modal data. Kang et al. [17] proposed a visual sensing and perception strategy based on LIDAR and camera fusion for the accurate localization of robots in real orchard environments. Li et al. [18] proposed a multi-sensor data fusion method that combines LIDAR and image data to achieve precise geometric alignment between LIDAR and image pixels. The fusion process employs a cross-attention mechanism to dynamically capture the correlation between images and LiDAR features to solve the 3D detection problem in autonomous driving. To achieve their objectives, the various multi-modal data fusion methods mentioned above not only explore data fusion models, but also integrate and modify existing models based on the characteristics and requirements of their own data fusion.

In 3D reconstruction tasks, data from a single sensor or data source often exhibit limitations in terms of accuracy and precision. To overcome these limitations, this paper proposes a multi-sensor data fusion method based on a self-attention mechanism. Therefore, to overcome the limitations of accuracy and precision in single-sensor or single-data-source data, this paper proposes a multi-sensor data fusion method based on a self-attention mechanism. The multi-sensor data fusion model constructed in this study was used to obtain multi-source and multi-modal data. The obtained data were treated as inputs to different-dimensional convolutional neural networks for feature extraction. Subsequently, all the modal features were concatenated using fully connected layers, and a self-attention mechanism was employed to adjust the importance of each input feature, resulting in a fused feature representation. Finally, a multi-layer perceptron (MLP) was utilized to generate the final output. The primary objective of this method is to enhance the accuracy and precision of the final reconstruction results by effectively integrating information from different sensors or data sources during the 3D reconstruction process. This method effectively captures and maximally utilizes information from multiple sensors or data sources, thereby improving the accuracy and robustness of the fusion results.

The remainder of this paper is organized as follows: Section 2 introduces the constructed multi-sensor data fusion model, including the sensor system used and its workflow. Section 3 provides a detailed description of the neural network architecture used for multi-source and multi-modal data fusion based on the self-attention mechanism. This section also discusses the modules for point cloud feature extraction, image feature extraction, and feature fusion. Section 4 presents the experimental results of the neural network, including the dataset used, evaluation metrics, parameter settings, and performance comparisons. Finally, Section 5 concludes the paper.

2. Model

Multi-sensor data fusion is a critical field aimed at utilizing information from multiple sensor systems to provide more accurate and complete outputs using data-fusion techniques. In this context, the proposed fusion model shown in Figure 1 integrates data from a binocular vision measurement system, laser tracking system, and depth camera system to generate high-precision point clouds as the output. The process of acquiring a point cloud using multiple sensors often involves the challenging task of achieving a unified coordinate system, which requires complex treatments such as coordinate transformation and registration. However, the proposed multi-sensor data fusion method does not require a unified coordinate system, simplifying the data preprocessing steps and providing a more efficient solution.

The workflow of the multi-sensor data fusion model begins by scanning the target object using a binocular vision measurement system. This step captures two types of point cloud data: low precision and high precision. A low-precision point cloud provides the overall structural information of the target object to the network; however, its accuracy is limited by noise and distortion. A high-precision point cloud serves as labeled data to guide the learning and optimization of the network to achieve the desired output. Subsequently, a laser tracking system is employed to measure the control points on the target object. The laser tracking system provides highly accurate position information for the high-precision control points used to calibrate and enhance the point cloud. Simultaneously, the depth camera system captures both RGB and depth images of the target object. RGB images provide color and texture information about the target object, while depth images provide distance and depth information. The image data provide richer visual information to the model, enabling a better understanding of the appearance and shape features of the target object.

All of these data, including the low-precision point cloud, high-precision control points, RGB images, and depth images, were preprocessed accordingly and input into the constructed convolutional neural network with self-attention (CNN-SA) multi-modal data fusion network. The purpose of this network is to learn how to effectively fuse multi-modal data from different sensor systems, with high-precision point clouds as the output. By training the neural network, the model learns the correlations and weights between the data, thereby maximizing the utilization of the information provided by each sensor system.

The multi-sensor data fusion model overcomes the limitations of a single-sensor system and maximizes the quality of the point clouds. The multi-modal data fusion network, which is the core component of the multi-sensor data fusion model, plays an important role in integrating and extracting information from different modalities, and the various constituent modules in the network will be introduced in detail in the next section.

3. Method

This section describes the three designed multi-modal modules for constructing the CNN-SA: the point cloud feature extraction module, the image feature extraction module, and the feature fusion module. Finally, we introduce the network architecture of CNN-SA.

3.1. Point Cloud Module

The point cloud feature extraction module in the CNN-SA utilizes three one-dimensional (1D) convolutional layers to extract features from the point cloud. To avoid offsets and bias, batch normalization layers were applied to normalize the outputs of the 1D convolutional layers, ensuring that the outputs had similar distributions. Subsequently, an activation function was applied for nonlinear transformation, processing the normalized results to retain positive values and enhance the nonlinear expressive capability of the network. The structure of this module is illustrated in Figure 2, where

P \in ℝ^{n \times 3}

represents 3D point cloud data, n denotes the number of points, and

F_{p} \in ℝ^{256 \times n}

represents point cloud features.

The 3D point cloud was input into the point cloud feature extraction module and underwent three sets of nonlinear mapping operations:

F_{1} = Re Lu (BN (f^{1} (P)))

(1)

F_{2} = Re Lu (BN (f^{1} (F_{1})))

(2)

F_{P} = Re Lu (BN (f^{1} (F_{2})))

(3)

where

f^{1}

is a standard 1D convolution operation with a kernel size of one and stride of one. The convolution operation does not alter the size of the feature map; however, the number of channels changes from 3 to 256. BN represents batch normalization, which normalizes the input to a batch size to stabilize the data distribution. ReLu denotes the activation function that transforms the input values into non-negative values.

3.2. Image Module

The structure of the image feature extraction module in the CNN-SA is illustrated in Figure 3, where

X \in ℝ^{c \times h \times w}

represents the image data, c and

h \times w

denote the number of channels and the size of the image, respectively, and

F_{x} \in ℝ^{256 \times 1}

represents the image features.

The image data serve as the input to the image feature extraction module, which first undergoes three sets of nonlinear mapping operations:

F_{1} = MaxPool (Re Lu (f^{3 \times 3} (X)))

(4)

F_{2} = MaxPool (Re Lu (f^{3 \times 3} (F_{1})))

(5)

F_{3} = MaxPool (Re Lu (f^{3 \times 3} (F_{2})))

(6)

where

f^{3 \times 3}

is a standard 2D convolution operation with a kernel size of

3 \times 3

and a stride of one. Zero-value pixels were padded around the input tensor. MaxPool represents the max pooling layer, with a pooling kernel size of

2 \times 2

and a stride of two. Its purpose is to downsample the feature map and reduce its size and depth by half while preserving important features. Three sets of nonlinear mapping operations alter the size and number of channels in the feature map. Subsequently, a linear mapping was performed.

F_{x} = Linear (Flatten (F_{3}))

(7)

where Flatten denotes the flatten layer, which flattens the tensor obtained from the nonlinear mapping operations into a 1D tensor for input to the Linear fully connected layer. The fully connected layer transforms the tensor into another 1D tensor containing 256 features

F_{x}

as the final output.

3.3. Fusion Module

The feature fusion module in the CNN-SA was built based on the self-attention mechanism, and its structure is illustrated in Figure 4. First, the four feature vectors obtained from the point cloud and image feature extraction modules were concatenated. These feature vectors include the point cloud feature vector

F_{p}^{pcd}

, control point feature vector

F_{p}^{point}

, RGB image feature vector

F_{x}^{rgb}

, and depth image feature vector

F_{x}^{depth}

. After concatenation, the resulting feature vector

F^{concat}

undergoes a specified dimensional transpose operation, resulting in

F_{tran}^{concat}

. Finally, the concatenated and transposed feature vector was input into the self-attention mechanism module, enabling it to adaptively adjust the importance of each input feature and enhance the overall expressive power of the network. It is worth noting that in the process of connecting feature vectors, the order in which they are connected does not have an effect on the final result.

The input features

F^{concat}

and

F_{tran}^{concat}

for the self-attention mechanism were calculated as follows, where concat represents the concatenation operation along the specified channel dimension and transpose represents the transpose operation along the specified dimension.

F^{concat} = concat (F_{p}^{pcd}, F_{p}^{point}, F_{x}^{rgb}, F_{x}^{depth})

(8)

F_{tran}^{concat} = transpose (F^{concat})

(9)

During forward propagation, the attention scores, which indicate the importance of each sequence, are calculated as follows:

{weight}_{init}

and

{bias}_{init}

represent the initialized weights and bias terms, respectively.

scores = (F_{tran}^{concat} \times {weight}_{init}) + {bias}_{init}

(10)

The application of Softmax to normalize the attention scores yields attention weights, which ensure that the elements of the feature vectors are within the range [0, 1] along the specified dimension and that they sum up to 1. The calculation is as follows:

{weight}_{soft} = \frac{\exp ({scores}_{i})}{\sum_{j} \exp ({scores}_{j})}

(11)

The self-attention vector after weighted summation is obtained through the following calculation by multiplying the

{weight}_{soft}

and

F_{tran}^{concat}

summed along the specified dimension to obtain the final self-attention vector.

weight = \sum ({weight}_{soft} \times F_{tran}^{concat})

(12)

The final output-fused feature

F^{output}

is calculated as follows: it combines self-attention-weighted results with the original feature vectors.

F^{output} = F^{concat} + weight

(13)

The feature fusion module, implemented based on the self-attention mechanism, fuses four different types of input features, which enables the network to utilize different types of feature information more effectively and improves multi-source multi-modal data fusion.

3.4. Network Architecture

CNN-SA is a multi-modal data fusion network that can simultaneously handle various types of input information from different sensor systems. The overall architecture of CNN-SA is illustrated in Figure 5 and mainly comprises the point cloud feature extraction module, image feature extraction module, self-attention mechanism, and multi-layer perceptron. The self-attention mechanism, among these components, serves as the feature fusion module.

The design and layer selection of the point cloud feature extraction module were primarily based on the characteristics of 3D point clouds and the advantages of CNNs. By employing three 1D convolutional layers, diverse features of 3D point clouds were extracted and downsampled while incorporating batch normalization operations and non-linear transformations through activation functions to effectively enhance the representation capacity of 3D point clouds, thus boosting the feature extraction capability of the entire network.

The image feature extraction module provides an effective and reliable approach to extracting image features from a network model. This module utilizes 2D convolutional and max-pooling layers to progressively compress input images and extract a series of features. By mapping these features to a vector space through fully connected layers, the module retains the original image information while effectively reducing feature dimensionality, thereby improving the computational efficiency and generalization ability of the model.

The self-attention mechanism can adaptively weigh the information at different positions in the input sequence to enable more effective utilization of the information in the input sequence during the input processing of the network, as well as to improve the network performance and accuracy. In addition, the self-attention mechanism can filter irrelevant data in the input sequence such that the network is more robust to noise and redundant information in the input sequence.

The multi-layer perceptron layer applies a series of linear and nonlinear transformations to the input features, and the outputs of this layer are mapped to the desired output dimension. By learning more complex and abstract representations in the feature space of the input data, the network achieved a better predictive performance.

The CNN-SA employs a point cloud feature module and an image feature module to process multi-source and multi-modal data, extracting their features. Additionally, it introduces a self-attention mechanism to weigh and sum features of different modalities. Finally, it uses a multilayer perceptron to map the outputs that pass through the layer to the desired output size, adaptively focusing on the importance of different modal data. This enables mutual support, complementarity, and correction among the multi-modal data.

4. Experiments

In this section, we first describe the dataset and network parameter settings used in the training experiments for the CNN-SA network. Then, we present an experimental analysis of the network’s performance metrics.

4.1. Dataset

The multi-modal dataset for the CNN-SA network was collected using a binocular vision measurement system, a laser tracking system, and a depth camera system, focusing on a wing model made of carbon fiber material. The dataset comprises low-precision point clouds, high-precision control points, RGB images, depth images, and high-precision point clouds. Low-precision point clouds provide approximate shape and structural information about the target object for the network, whereas high-precision control points provide highly accurate positional information. The RGB and depth images provide rich visual information about the target object, and high-precision point clouds serve as the target output for the network, providing more accurate 3D shape information.

As shown in Figure 6, we performed a series of pre-processing steps to prepare the dataset for network training. First, the RGB and depth images captured using the depth camera system were masked to only extract the parts containing the object, thus reducing the impact of background noise on the network. Second, the low-precision and high-precision point clouds underwent the same denoising and downsampling processes. They were then paired with high-precision control points, RGB images, and depth images to function as network inputs. Finally, the processed high-precision point clouds were used as the target output for the network, enabling an accurate reconstruction of the low-precision point clouds through training.

The dataset used for the CNN-SA network training was large, consisting of 5000 sets of samples. Each sample included low-precision point clouds, high-precision control points, RGB images, depth images, and high-precision point clouds. Measurements were conducted on different regions and angles of the same target, resulting in diverse datasets.

4.2. Evaluation Metrics

The L1 Loss (absolute value loss) is used as a loss function to optimize the neural network model on the training set to reduce the possibility of overfitting, and is described as

Loss = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(14)

where

y_{i}

represents the true output for each sample and

{\hat{y}}_{i}

represents the predicted output of the network for the sample.

The mean square error (MSE) is used as a loss function to measure the performance of the neural network on the test set.

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(15)

The root mean square error (RMSE) is used to assess the degree of fit of the network to the training and test sets and the average quantized value of the network prediction error. It is defined as follows:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(16)

4.3. Parameter Settings

To build the network, experiments were conducted on Windows 11 using PyTorch 2.0.0 and Python 3.10.9. The experimental environment used an NVIDIA GeForce RTX 3060 laptop GPU with 14 GB of running memory.

First, the RGB (

480 \times 640

) and depth images (

480 \times 640

) after masking were centrally cropped to a size of

224 \times 224

using center cropping. The cropped images were then converted to tensors as inputs to the network. The multi-modal dataset comprises 5000 sample sets, with 3000 sets used as the training set, 1000 sets as the test set, and an additional 1000 sets as the validation set. The training and test set data were used for network training, whereas the validation set data were used solely for the final performance evaluation and did not participate in network training. The Adam optimization algorithm was employed to optimize the training parameters and weights. The batch size was set to eight, the initial learning rate was 0.001, and the learning rate was decreased by multiplying it by 0.7 every 10 iterations. The number of epochs was set to 350, and the comparison results were computed as the average of five iterations. The specific configurations of the network parameters are listed in Table 1.

4.4. Performance Comparison

To validate the effectiveness of the CNN-SA network, a multi-source and multi-modal data fusion network based on the self-attention mechanism and various architecture-related networks was compared with the CNN-SA. The comparison models included CNN, CNN-CBAM [19], and CNN-LSTM [20], all of which had parallel network structures. CNN served as a baseline for comparison. Figure 7 illustrates the overall RMSE and loss function curves for different networks on the test set. In Figure 7a, the RMSE of each network experiences an initial decrease, followed by stability, with CNN-SA exhibiting the best accuracy and stability. Figure 7b presents the loss function curves for all networks, where CNN-SA demonstrates the fastest convergence speed and the lowest optimal value, further confirming the effectiveness of the proposed network.

To further analyze the performance of each network, the differences in the fusion results for each network were statistically measured using the validation set samples and plotted as box plots. The box plot in Figure 8 shows the distribution of the difference in values for each network.

These results clearly show the advantages of the CNN-SA network. With a more compact distribution of disparity values and a lower median, it can more accurately predict the point cloud data that match the labeled data, further validating the effectiveness and superiority of the CNN-SA model in the point cloud reconstruction task.

To provide a more intuitive comparison of the performance of each network in point cloud reconstruction tasks, a sample instance from the validation set was selected and inputted into each network for fusion. As shown in Figure 9, the fusion results of each network are compared with the label data, and the deviation maps of the fusion results and label data were plotted. It can be observed that the CNN-CBAM network exhibits significant deviations and distortions in the data fusion task. The fusion results of the CNN network were slightly better than those of the CNN-CBAM network, but still exhibited some errors. In comparison, the fusion results of the CNN-LSTM network were markedly improved, with smaller errors. Compared with other networks, CNN-SA demonstrated higher point cloud matching and shape consistency, with smaller errors and smoother edges.

By observing the deviation maps, it can be concluded that the CNN-SA network outperforms the other networks in multi-source and multi-modal data fusion tasks. It is capable of more accurately predicting point clouds consistent with label data while reducing prediction errors and shape distortions.

5. Conclusions

The proposed multi-sensor data fusion method based on the self-attention mechanism builds a novel multi-sensor data fusion model that effectively fuses and collaboratively learns data from binocular vision measurement, laser tracking, and depth camera systems. A multi-source and multi-modal data fusion network, CNN-SA, served as the core component of the multi-sensor data fusion model. By incorporating the self-attention mechanism, CNN-SA could adaptively learn the correlations and weighs between different modalities, thus improving the accuracy of data fusion by better utilizing multi-sensor data. A significant advantage of the CNN-SA network was its ability to handle data without a unified coordinate system. Furthermore, compared with the input data, the CNN-SA network achieved an accuracy improvement of 72.6%, surpassing the improvements of 29.9% for CNN-CBAM, 23.6% for CNN, and 11.4% for CNN-LSTM. These results demonstrate the significant accuracy improvement achieved by the CNN-SA network in multi-source and multi-modal data fusion tasks, providing strong support for further research and applications in the field of multi-sensor data fusion processing. In future, the application of a metaheuristic optimization algorithm to the proposed CNN-SA neural network is expected to improve the efficiency of multi-sensor data fusion.

Author Contributions

Conceptualization, X.L. and D.Y.; methodology, S.C. and D.Y.; software, S.C.; validation, S.C.; formal analysis, S.C. and D.Y.; investigation, S.C.; resources, X.L.; data curation, X.L.; writing—original draft preparation, S.C.; writing—review and editing, S.C.; visualization, S.C.; supervision, X.L., D.Y., L.G., Y.L. and L.L.; project administration, X.L. and S.C.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Project of the Jilin Province Science and Technology Development Program [No. 20200401019GX] and the Zhongshan Social Public Welfare Science and Technology Research Project [No. 2022B2013].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the fact that the data is only for internal communication within the laboratory.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhao, L.; Zhang, H.; Mbachu, J. Multi-Sensor Data Fusion for 3D Reconstruction of Complex Structures: A Case Study on a Real High Formwork Project. Remote Sens. 2023, 15, 1264. [Google Scholar] [CrossRef]
Cui, Y.; Li, Q.; Yang, B.; Xiao, W.; Chen, C.; Dong, Z. Automatic 3-D reconstruction of indoor environment with mobile laser scanning point clouds. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3117–3130. Available online: https://ieeexplore.ieee.org/abstract/document/8736013 (accessed on 12 June 2019). [CrossRef]
Kang, H.; Wang, X. Semantic segmentation of fruits on multi-sensor fused data in natural orchards. Comput. Electron. Agric. 2023, 204, 107569. [Google Scholar] [CrossRef]
Zhang, Q.; Kang, S.; Yin, C.; Li, Z.; Shi, Y. An Adaptive Learning Method for the Fusion Information of Electronic Nose and Hyperspectral System to Identify the Egg Quality. Sens. Actuators A Phys. 2022, 346, 113824. [Google Scholar] [CrossRef]
Tang, Q.; Liang, J.; Zhu, F. A Comparative Review on Multi-Modal Sensors Fusion Based on Deep Learning. Signal Process. 2023, 213, 109165. [Google Scholar] [CrossRef]
Lin, J.; Li, T.; Xie, P.; Du, S.; Teng, F.; Yang, X. Urban Big Data Fusion Based on Deep Learning: An Overview. Inf. Fusion 2020, 53, 123–133. [Google Scholar] [CrossRef]
Weckenmann, A.; Jiang, X.; Sommer, K.-D.; Neuschaefer-Rube, U.; Seewig, J.; Shaw, L.; Estler, T. Multisensor Data Fusion in Dimensional Metrology. CIRP Ann. 2009, 58, 701–721. [Google Scholar] [CrossRef]
Nguyen, A.; Nguyen, N.; Tran, K.; Tjiputra, E.; Tran Quang, D. Autonomous navigation in complex environments with deep multimodal fusion network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5824–5830. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, Y.; Zhai, J.; Zhao, D.; Xu, L.; Zhou, J.; Li, Z.; Yang, S. Multi-Source Data Fusion Using Deep Learning for Smart Refrigerators. Comput. Ind. 2018, 95, 15–21. [Google Scholar] [CrossRef]
Li, S.; Wang, H.; Song, L.; Wang, P.; Cui, L.; Lin, T. An Adaptive Data Fusion Strategy for Fault Diagnosis Based on the Convolutional Neural Network. Measurement 2020, 165, 108122. [Google Scholar] [CrossRef]
Poliyapram, V.; Wang, W.; Nakamura, R. A Point-Wise LiDAR and Image Multimodal Fusion Network (PMNet) for Aerial Point Cloud 3D Semantic Segmentation. Remote Sens. 2019, 11, 2961. [Google Scholar] [CrossRef]
Rosas-Cervantes, V.; Hoang, Q.D.; Woo, S.; Lee, S.G. Mobile Robot 3D Trajectory Estimation on a Multilevel Surface with Multimodal Fusion of 2D Camera Features and a 3D Light Detection and Ranging Point Cloud. Int. J. Adv. Robot. Syst. 2022, 19, 17298806221089198. [Google Scholar] [CrossRef]
Wu, Y.; Jiang, X.; Fang, Z.; Gao, Y.; Fujita, H. Multi-Modal 3D Object Detection by 2D-Guided Precision Anchor Proposal and Multi-Layer Fusion. Appl. Soft Comput. 2021, 108, 107405. [Google Scholar] [CrossRef]
Caltagirone, L.; Bellone, M.; Svensson, L.; Wahde, M. LIDAR–Camera Fusion for Road Detection Using Fully Convolutional Neural Networks. Robot. Auton. Syst. 2019, 111, 125–131. [Google Scholar] [CrossRef]
Zhu, Y.; Xu, R.; An, H.; Tao, C.; Lu, K. Anti-Noise 3D Object Detection of Multimodal Feature Attention Fusion Based on PV-RCNN. Sensors 2023, 23, 233. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Li, Z.; Gao, X.; Jin, D.; Li, J. Channel Attention in LiDAR-Camera Fusion for Lane Line Segmentation. Pattern Recognit. 2021, 118, 108020. [Google Scholar] [CrossRef]
Kang, H.; Wang, X.; Chen, C. Accurate fruit localisation using high resolution LiDAR-camera fusion and instance segmentation. Comput. Electron. Agric. 2022, 203, 107450. [Google Scholar] [CrossRef]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17182–17191. Available online: https://openaccess.thecvf.com/content/CVPR2022/html/Li_DeepFusion_Lidar-Camera_Deep_Fusion_for_Multi-Modal_3D_Object_Detection_CVPR_2022_paper.html (accessed on 24 June 2022).
Vanian, V.; Zamanakos, G.; Pratikakis, I. Improving Performance of Deep Learning Models for 3D Point Cloud Semantic Segmentation via Attention Mechanisms. Comput. Graph. 2022, 106, 277–287. [Google Scholar] [CrossRef]
Huang, R.; Zhang, W.; Kundu, A.; Pantofaru, C.; Ross, D.A.; Funkhouser, T.; Fathi, A. An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds. In Proceedings of the 16th European Conference, Computer Vision-ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 266–282. [Google Scholar] [CrossRef]

Figure 1. Multi-sensor data fusion model.

Figure 2. Point cloud feature extraction module.

Figure 3. Image feature extraction module.

Figure 4. Feature fusion module.

Figure 5. The overall architecture of the CNN-SA.

Figure 6. The flow chart for data management.

Figure 7. Results. (a) Test RMSE, (b) test loss.

Figure 8. Performances of different fusion networks.

Figure 9. Deviation map. (a) CNN-CBAM deviation map, (b) CNN deviation map, (c) CNN-LSTM deviation map, and (d) CNN-SA deviation map.

Table 1. Network specific parameter configuration table.

Network Layers	Network Type	Number of Neurons	Kernel Size	Stride	Activation Function
1	Convolutional Layer	16	(3, 3)	1	ReLu
2	Max-Pooling Layer	-	(2, 2)	2	-
3	Convolutional Layer	32	(3, 3)	1	ReLu
4	Max-Pooling Layer	-	(2, 2)	2	-
5	Convolutional Layer	64	(3, 3)	1	ReLu
6	Max-Pooling Layer	-	(2, 2)	2	-
7	Flatten Layer	-	-	-	-
8	Linear Layer	256	(50,176, 256)	-	-
9	Convolutional Layer	64	1	1	-
10	Batch Normalization Layer	64	-	-	ReLu
11	Convolutional Layer	128	1	1	-
12	Batch Normalization Layer	128	-	-	ReLu
13	Convolutional Layer	256	1	1	-
14	Batch Normalization Layer	256	-	-	ReLu
15	Self-Attention Layer	256	-	-	-
16	Linear Layer	1036	-	-	-
17	Convolutional Layer	128	1	1	-
18	Batch Normalization Layer	128	-	-	ReLu
19	Convolutional Layer	64	1	1	-
20	Batch Normalization Layer	64	-	-	ReLu
21	Convolutional Layer	3	1	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, X.; Chao, S.; Yan, D.; Guo, L.; Liu, Y.; Li, L. Multi-Sensor Data Fusion Method Based on Self-Attention Mechanism. Appl. Sci. 2023, 13, 11992. https://doi.org/10.3390/app132111992

AMA Style

Lin X, Chao S, Yan D, Guo L, Liu Y, Li L. Multi-Sensor Data Fusion Method Based on Self-Attention Mechanism. Applied Sciences. 2023; 13(21):11992. https://doi.org/10.3390/app132111992

Chicago/Turabian Style

Lin, Xuezhu, Shihan Chao, Dongming Yan, Lili Guo, Yue Liu, and Lijuan Li. 2023. "Multi-Sensor Data Fusion Method Based on Self-Attention Mechanism" Applied Sciences 13, no. 21: 11992. https://doi.org/10.3390/app132111992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Sensor Data Fusion Method Based on Self-Attention Mechanism

Abstract

Featured Application

Abstract

1. Introduction

2. Model

3. Method

3.1. Point Cloud Module

3.2. Image Module

3.3. Fusion Module

3.4. Network Architecture

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Parameter Settings

4.4. Performance Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI