3.4. MN-SIFT
To compute MN-SIFT [
17] descriptors, a circular region around each detected SIFT feature point is cropped from the image. The radius of the region is proportional to scale (
) of the SIFT feature point. The region is sub divided into
location bins as shown in
Figure 2. The bins are denoted as
where
. The region is convolved with kernels to obtain derivatives
and
along horizontal and vertical directions, respectively. The gradient magnitude (
) and gradient orientation (
) are calculated at each pixel location
as:
Then modified gradient magnitudes (
) are computed as:
where
and
are the region’s minimum and maximum gradient magnitude values, respectively. The pixels of each location bin are identified as follows:
The pixels of a location bin are denoted as
where
s represents the region’s size. The gradient orientations
of the region are quantized into eight different levels as follows:
where
represents a modular operator i.e, the modular operation on
produces
. Then a feature histogram
is computed for each location bin as follows:
where
and
is defined as:
The histograms are concatenated over all the location bins to obtain MN-SIFT descriptor.
3.5. Regression Modeling Using Corresponding Descriptors
We trained a regression model on corresponding MN-SIFT descriptors of the train set. To understand the training process, let I
and I
be two images of the same scene. These images depict the same scene contents in VS and IR bands, respectively. Feature points are detected on I
and I
images and the feature point locations (pixels) are stored as
and
, respectively, where
,
.
u and
v represent total number of feature points detected on I
and I
images, respectively. The feature points of I
are projected onto I
with a homography
K, which acts as ground truth data between I
and I
. This homography is known in advance between every VS-IR images of the training set according to Reference [
18,
34]. We use a projection error of 2 pixels to identify corresponding feature points between I
and I
.
Figure 3 shows detected and corresponding feature points as blue ’+’ and green ’o’ between VS-LWIR images of a MSD scene. respectively.
Then MN-SIFT descriptors are computed for corresponding feature points. Such descriptors are referred to as corresponding/correct descriptors. Let
be a descriptor of
image and let its corresponding descriptor match is
in
image:
where
represents the length of
and
MN-SIFT descriptors. Let
be a model function that gives an error
when it is subtracted from the first element of
:
where
are the parameters of
which are required to be learnt to minimize the
error i.e., square of error between the first element of
and the model function
. To learn
, projection error equal to or less than 2 pixels is used to identify
n corresponding MN-SIFT descriptors between
and
images. Then corresponding descriptors are stored as
R and
T matrices where
and
are corresponding MN-SIFT descriptors, which are stored as the
ith row of
R and
T, where
. With
n corresponding descriptors,
n errors are obtained as:
Here
is learnt with an objective to minimize
i.e., sum of squared error. Similarly,
,
, …,
are learnt with the same objective to minimize
, respectively, where
The regression modeling, as explained above, was based on a single image pair, that is, and . In the case of a dataset, for instance RGB-NIR and MSD datasets, each dataset is randomly divided into two disjoint sets. One set for training and the other one for testing. The corresponding descriptors are obtained from each image pair of the training set and are appended as rows to from matrices R and T. SIFT detector gives on average 380 corresponding feature points per image pair with projection error of equal to less than 2 pixels. If there are 10 image pairs in the training set, then total number of corresponding descriptors (i.e rows of R and T) obtained is and the unknown s of the model functions are learnt on these corresponding descriptors as explained above.
In fact s represent the parameters of a regression model. In this paper we use five different regression models to compute the proposed -SIFT. These regression models are Linear Regression (LR), Decision Tree Regression (DTR), Random Forest Regression (RFR), Support Vector Machine Regression (SVMR) and Multi-Layer Perceptron Regression (MLPR). All these models are implemented in Python with Sklearn library except the MLPR, which is implemented using the Keras and Tensorflow libraries.
Figure 1 shows
-SIFT block, which is obtained by processing the MN-SIFT descriptors of test VS images with the trained regression model. We are using five different models; therefore,
. To understand the processing of MN-SIFT to get Reg-SIFT, consider
and
be two images of the test set depicting the same scene in VS and IR bands, respectively. SIFT feature points are detected and MN-SIFT descriptors are computed. Let
and
be two sets of MN-SIFT descriptors that belong to
and
images, respectively, where
and
. Total number of descriptors computed for
and
images are denoted as
w and
z, respectively. Each
descriptor is processed through the regression model (i.e., testing) to obtain a Reg-SIFT descriptor (
) and then the Reg-SIFT descriptors are matched with the MN-SIFT descriptors (
) of image
. To understand the process, let
be an MN-SIFT descriptor which is converted into a Reg-SIFT descriptor
as follows:
After that, image matching using Reg-SIFT is carried out and the matching results are compared with state of the art descriptors.
We use MN-SIFT for proposed Reg-SIFT because of its better robustness towards intensity and textual changes. MN-SIFT is based on MN features, which contain both local textural and structure information and performs well in cross spectral applications compared to NG-SIFT in [
9] . NG-SIFT encapsulates only the structural information [
15]. The experimental results of [
9] show that MN-SIFT demonstrates better performance on multisensor images than SIFT, LC-SIFT, LBPG, DE-SIFT, and CS-LBP descriptors.
Another reason for choosing MN-SIFT is its descriptor construction process, which is simple compared to that of EOH, LSS, MFD and HoDM descriptors. EOH uses Canny edges. The detection of Canny edges is relatively simple on VS images, but fails on LWIR/NIR images due to low contrast. LSS is based on local self-similarity between a small region and a larger one around the feature points and computed with sum of square differences approach. MFD uses directional Log-Gabor filters. These filters are more computationally expensive than simple directional filters used in MN-SIFT. HoDM uses Sobel filters in four different directions to compute image gradients. Then absolute values of the gradients are calculated and the weak gradients are suppressed with a hypothesis. The four image gradient values at each pixel location are compared and binarized. The absolute and binary gradients are then read with a spatial pooling scheme to compute HoDM descriptors. The HoDM process is also computationally expensive than MN-SIFT.