Rethink Motion Information for Occluded Person Re-Identification

Liu, Hongye; Chen, Xiai

doi:10.3390/app14062558

Open AccessArticle

Rethink Motion Information for Occluded Person Re-Identification

by

Hongye Liu

and

Xiai Chen

^*

School of Mechanical and Electrical Engineering, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2558; https://doi.org/10.3390/app14062558

Submission received: 16 January 2024 / Revised: 5 March 2024 / Accepted: 13 March 2024 / Published: 19 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

Person re-identification aims to identify the same pedestrians captured by various cameras from different viewpoints in multiple scenarios. Occlusion is the toughest problem for practical applications. In video-based ReID tasks, motion information can be easily obtained from sampled frames, and provide discriminative human part representations. However, most motion-based methodologies are designed for video frames which are not suitable for processing single static image input. In this paper, we propose a Motion-Aware Fusion (MAF) network, aiming to acquire motion information from static images in order to improve the performance of ReID tasks. Specifically, a visual adapter is introduced to enable visual feature extraction, either from image or video data. We design a motion consistency task to guide the motion-aware transformer to learn representative human-part motion information and greatly improve the learning quality of features of occluded pedestrians. Extensive experiments on popular holistic, occluded, and video datasets demonstrate the effectiveness of our proposed method. This method outperforms state-of-the-art approaches by improving the mean average precision (mAP) by 1.5% and rank-1 accuracy by 1.2% on the challenging Occluded-REID dataset. At the same time, it surpasses other methods on the MARS dataset with an improvement of 0.2% in mAP and 0.1% in rank-1 accuracy.

Keywords:

person re-identification; motion-aware; occlusion

1. Introduction

Person re-identification (ReID) aims to identify the same pedestrians captured by a variety of cameras from different viewpoints and in various scenarios [1,2,3,4,5]. ReID has a wide range of real applications and can have a significant impact on a variety of industries. For example, by identifying and tracking individuals in public places, ReID can enhance public safety and potentially reduce crime rates. In retail, ReID can be used for customer traffic counts and behavioral analysis. By analyzing the recognition characteristics of pedestrians, traffic management authorities can better monitor the use of sidewalks and provide a safer traffic environment for pedestrians. In addition, ReID can be applied to traffic flow monitoring and congestion prediction.

The ReID system can be utilized in both image-based and video-based environments. Image-based techniques [6,7,8] aim to link still images, such as a single frame from a camera, of individuals captured by a network of non-overlapping cameras. On the other hand, video-based ReID [9,10,11] involves matching the input video tracklets of an individual against a collection of tracklet representations. Compared to image-based methods, video-based ReID benefits from the motion and spatio-temporal information provided by video data, allowing the system to identify a person’s body silhouette and distinctive human parts more effectively. Many video-based methods [12,13,14] also incorporate motion information to reduce the impact of background objects and address the issue of occlusion. It is worth noting that incorporating motion information from still images can further improve the handling of occlusions in challenging scenarios. While recent deep learning methods have produced satisfying retrieval performances in the main pedestrian regions, the problem of occlusions caused by diverse obstacles remains a challenge in real-world applications.

Compared to the general person re-identification (ReID) problem, the current challenge of occluded person ReID is two-fold. Firstly, interference from unknown objects can cause significant fluctuations in human features, leading to difficulties in feature extraction. To address this issue, previous methods [8,15] have employed targeted occlusion data enhancement or introduced a more robust pre-trained model. However, these approaches tend to focus more on the person’s appearance features and memorize specific occlusion types, resulting in a lack of robustness in the extracted human-part features. In real-world scenarios, occlusion types are often unpredictable and randomly located, making specific occlusion data augmentation limited in terms of its generalizability across the entire domain. Secondly, exploring more representative person features is crucial for the occluded person ReID framework. Multi-pedestrian occlusions [16,17] are particularly challenging compared to other types of occlusions. In these scenarios, the model’s ability to distinguish the features of different pedestrians becomes even more important [18]. However, relying solely on the external features of a pedestrian’s appearance is not sufficient. Implicit features are also needed for effective ReID.

To address these challenges, we propose three frameworks that deeply examine the implicit motion information and explicit visual characteristics of pedestrians. The first framework focuses on the dynamic processing of image or video feature inputs, providing a unified architecture for both image-based and video-based ReID tasks. We introduce a visual adapter with a set of learnable visual queries to integrate visual features from the visual encoder and reduce computational complexity in the cross-attention between motion and visual information. The adapter considers a single image as a single-frame video, enabling our framework to handle both image-based and video-based ReID tasks. The second framework seeks to obtain implicit motion information through a motion-aware transformer. By passing the integrated visual features through the transformer, we establish another learnable query to obtain human per segment motion representations. We also design a motion consistency task to extract motion information from still images and continuously refine the motion representations, without relying on any pre-trained models. The final framework fuses visual features and motion information using a standard vision transformer architecture. The fusion encoder learns the relationship between notable human parts and human per segment motion representations. To evaluate the effectiveness of our approach, we conduct experiments on both image-based (including occluded and holistic) and video-based ReID benchmarks. As shown in Figure 1, our proposed method achieves competitive results on both image- and video-based ReID tasks.

The main contributions in this paper can be summarized as follows:

A novel architecture is proposed to simultaneously deal with video-based and image-based ReID tasks.
We propose a motion-aware transformer and a motion consistency task to extract human motion information, which not only provides discriminative human representations but alleviates dressing similarly.
Sufficient experiments on several public video-based and image-based ReID datasets have demonstrated that our proposed framework outperforms the state-of-the-art methods.

2. Related Work

In this section, we first review the development of person ReID methods. As shown in Table 1, we provide a brief summary of some of the classic ReID methods. Then, we review motion-supervised segmentation methods, which inspire us to better utilize the motion information.

2.1. Image Person Re-Identification

Image person re-identification, including holistic and occluded parts, aims to deal with retrieving a person of interest in other camera views. With the generation of large-scale datasets and the development of deep learning methods, recent works that utilize transformers to obtain refined human features, have achieved the best performance on holistic person ReID tasks. Li et al. [7] proposed a Part-Aware Transformer to deal with occlusion situations, which utilizes the transformer encoder–decoder architecture with learnable part prototypes for occluded person Re-ID and achieves a competitive performance. Wang et al. [8] proposed a feature diffusion model, including a non-pedestrian occlusion augmentation strategy, an occlusion erasing module, and a feature diffusion module, to help the model distinguish diverse occluded situations and precisely perceive target pedestrians. Tan et al. [6] designed a dynamic prototype mask for occluded person Re-ID, which does not rely on extra pre-trained networks, but uses a hierarchical mask generator to enrich the holistic prototype, and simultaneously retains the information from the whole image and achieves automatic alignment. Although those methods achieve satisfactory performance in the holistic or occluded ReID benchmarks, some occlusion-based models will still be affected by unpredictable occlusion and are largely influenced by pedestrians’ appearance.

2.2. Video Person Re-Identification

Compared to image data, the additional temporal relations in video effectively alleviate many issues, such as occlusion and motion blur. It is also easier to acquire motion information and optical flow. One mainstream method is utilizing temporal attention to measure the importance of each frame and give up the low-quality frames at the same time. The other mainstream method is mutual enhancement by utilizing the self-attention or GCNs to better model the temporal relations and enhance the dependencies between frames. For instance, Yin et al. [23] proposed a motion information-based network, which utilizes an RNN-mask network to obtain motion information and introduces a pre-trained keypoint detector to obtain four local part features. Kiran et al. [21] proposed a mutual attention network to acquire spatio-temporal video features for ReID using optical flow. Bai et al. [12] designed a salient-to-broad module to leverage the temporal relations from the perspective of difference amplification and obtained more comprehensive and informative representations. However, these methods still have some drawbacks, such as the low ability of global-range feature concatenation and high computational cost.

2.3. Motion-Guide Segmentation

Siarohin et al. [24] presented a self-supervised deep learning method for co-part segmentation that leverages motion information to obtain human segments. Similar to the previous work [25,26,27], it relies on a reconstruction objective to disentangle the object’s semantic and appearance representations. These methods heavily rely on the reconstruction model to complete the entire training stage, which also have high computational requirements. However, the use of motion information from previous works has inspired us. In particular, this work is inspired by the fact that motion information can be used to distinguish human parts and provide latent motion tokens.

In contrast to the above methods, our methods can simultaneously deal with image- and video-based ReID tasks with lower computational complexity. At the same time, the motion-aware transformer with the motion consistency task enables it to obtain motion information from a still image.

3. Methodology

In this section, we introduce our proposed Motion-Aware ReID method in detail. As shown in the left part of Figure 2, it mainly consists of four modules, including a visual encoder, a visual adapter, a motion-aware module, and a fusion encoder. Here, we briefly give a general introduction to our ReID process. First, we extract the vision features from the full image context with the visual encoder module. Next, the vision adapter module is devised to integrate these features into visual tokens. Taking visual tokens as input, a motion-aware module is carefully designed and trained to further acquire motion tokens from these visual tokens. Then, we jointly merge visual tokens, motion tokens, and a hybrid class token together to feed the fusion encoder module. Finally, we utilize this hybrid class token in the ReID task head to make pedestrian identifications.

3.1. Visual Encoder and Visual Adapter

As shown in Figure 2b, the visual encoder module serves as a backbone to extract visual features. As traditional convolutional-based neural networks cannot extract robust features of the target person under different background regions with diverse characteristics very well [7], we adopt a pre-trained ViT-B/16 as our default visual encoder.

During the training process, the images in each batch will be paired, and there will not be a situation where there is only one image with one ID. This collection strategy is prepared for calculating motion consistency loss in Section 3.2.

Similarly, in the ViT part, the visual encoder module reshapes the T input frames of 2D images

X \in R^{T \times H \times W \times C}

, into a sequence of flattened image patches

X_{p} \in R^{T \times N \times (P^{2} \cdot C)}

. Here,

(H, W, C)

represent the height, width, and channel of the original image, respectively. The sequence contains a total of

N = H \cdot W / P^{2}

image patches, and the dimension size of each image patch is

(P^{2} \cdot C)

. To keep the constant latent vector size D consistent through all layers in this module, we also apply the linear projection to transform these patches from

(P^{2} \cdot C)

dimensions to D dimensions. Then, we add the embedding vector of positional information to obtain a sequence of extracted visual features

F_{N} \in R^{T \times N \times D}

.

The next module, named the visual adapter module, is in charge of integrating these extracted visual features

F_{N}

into visual tokens

{VT}_{L} \in R^{L \times D}

. Note that L denotes the number of output visual tokens. Typically, we set L to be smaller than N to further reduce the subsequent computational complexity. For previous ReID methods for video, the input features are

f \in R^{T \times N \times D}

and its computational complexity is

O (T \times N \times D)

. When we utilize the visual adapter, its computational complexity becomes

O (L \times D)

, where

L < < (T \times N)

, and thus the complexity of our method could be significantly reduced.

The key point in the visual adapter module is its capability to process both video and image inputs. By changing the input frame T, we can adapt to different types of input data (i.e., image, video, and hybrid). As shown in Figure 2b, the visual adapter module associates each input frame with the yellow “position” embeddings, according to its frame sequence order. Then, the resulting sequence of embedding vectors serves as input keys and values of the following multi-head self-attention block. Finally, the adapter module generates a fixed number of L visual tokens, and L adapter queries are trained to learn about how to integrate visual features together [28,29].

As several images

(T > 1)

of the same person contain more rich spatial-temporal information when compared to a single image input [30], this integration process also helps retaining adequate spatial-temporal information and improves the robustness of the training process [24].

3.2. Motion-Aware Transformer

In Figure 2c, the motion-aware transformer module takes the aforementioned visual tokens

{VT}_{L}

from the visual adapter module to generate corresponding motion tokens

{MT}_{L}

. It consists of a standard cross-attention layer [31], a multi-head self-attention layer, and a feed-forward network layer. The cross-attention layer aims to extract foreground human body parts from the

{VT}_{L}

with the learnable queries. Next, the self-attention blocks further incorporate the local context of human parts into separate part prototypes. The feed-forward network (FFN) part, consisting of two fully connected layers, introduces non-linearity and produces attention output

{MT}_{L}

.

To learn valid and effective queries, we elaborate a motion consistency task and corresponding loss function for the training process of the motion-aware transformer. To make it easier to explain the main steps of the motion consistency task, we choose two images from different views as input in Figure 3. Note that the motion tokens are independently generated from the consistency task, and consequently, the proposed motion-aware transformer module can be used for a single image at inference time.

For the two images (i.e., source and target), the MLP-1 utilizes the motion tokens

{MT}_{L}

to obtain the segmentation results of human parts

M_{S, p a r t}

and

M_{T, p a r t}

. The output of MLP-1 (

M_{p a r t} \in R^{N \times H_{D} \times W_{D}}

) represents different probability distributions for the N different human parts. Here, the number of human parts divided in the motion-aware transformer is determined by the length N of learnable queries;

H_{D}

and

W_{D}

mean the height and width dimension of segments. In Figure 3, we set

N = 10

, and obtain ten probability distributions for the ten human parts. Formally, let

M_{p a r t}^{k}

be the

k^{t h} (k \in [0, \dots, N - 1])

segment of human parts

M_{p a r t}

. In the probability distribution

M_{p a r t}^{k}

, we define the highest probability point in the distribution as the key point

p_{p a r t}^{k}

. The key point,

p_{p a r t}^{k} = (i, j); i \in [1, H_{D}], j \in [1, W_{D}]

, represents the location associated with the

k^{t h}

segment of human parts. Hereby, for the source image and target image, we can both extract ten key points of different human parts according to the segmentation results of MLP-1.

We design the MLP-2 to describe the motion of all points in the segments of human parts

M_{p a r t}

. The output of MLP-2 represents an affine transformation [24], which is used to approximate the optical flow

F

of every segment of human parts. Here, we assume that the motion of each segment follows an affine model [24], this implies that there exists

A \in R^{2 \times 2}

and

β \in R^{2}

such that:

\forall z \in M_{p a r t}^{k}, F^{k} (z) = A z + β,

(1)

Here, z is the location point of the

M_{p a r t}^{k}

. After the output of MLP-2 explicitly approximates the affine parameters

A

and

β

for our given source image (S) and the target image (T), we can obtain F with the following equation:

\forall z \in M_{p a r t}^{k}, F^{k} (z) = p_{S, p a r t}^{k} + A_{S}^{k} {A_{T}^{k}}^{- 1} (z - p_{T, p a r t}^{k})

(2)

where

p_{S, p a r t}^{k} \in R^{2}

and

p_{T, p a r t}^{k} \in R^{2}

are the selected key points of the source image and the target image, respectively.

A_{S}^{k}

and

A_{T}^{k}

are the predicted motion descriptions from the MLP-2. In other words, the optical flow

F

can be approximated by an affine transformation corresponding to each segmented part.

Given the source image, and the calculated optical flow fields

F

of each segment

M_{p a r t}^{k}

, we can approximate each segment of human parts in the target image by:

{\hat{M}}_{T, p a r t}^{k} = M_{S, p a r t}^{k} \otimes F^{k},

(3)

where ⊗ denotes the element-wise product. Taking the shoulder part of same person in Figure 3 as an example, the segmentation map corresponding to the shoulder part in the source image is multiplied with the corresponding

F^{k}

to obtain the prediction

{\hat{M}}_{T, p a r t}^{k}

for the segmentation map of the shoulder part after the motion. Our goal is to make this prediction similar to the segmentation map

M_{T p a r t}^{k}

corresponding to the shoulder part in the target image.

A popular method to compare the similarity of two probability distributions is KL-divergence. The distribution should be as consistent as possible, so

K L

loss is used here to restrict the motion consistency. Finally, the motion consistency loss can be calculated by:

L_{m c} = \sum_{k} {\hat{M}}_{T}^{k} log \frac{{\hat{M}}_{T}^{k}}{M_{T}^{k}} + L_{e q},

(4)

The first summation part in the equation is the

K L

loss, where

L_{e q}

represents the equivariance constraint loss [27]. The equivariance constraint loss is calculated by thin-plate spline deformations, which have been widely used in unsupervised key point detection [25,32] to ensure the robustness and stability of the training process. We also adopt

L_{e q}

to mainly stabilize our training process and make human part segmentation maintain discriminative. The motion consistency loss

L_{m c}

constantly optimizes the learnable queries. Eventually, the implicit motion information, human part-segment information, and their relationships are integrated into the learnable queries. We finally obtain more accurate motion tokens from the motion-aware transformer module.

3.3. Fusion Encoder

The fusion encoder module mainly outputs an additional token

h_{m_{c l s}}

for the final re-identification tasks. We apply learnable linear projections over visual tokens

{VT}_{L}

and motion tokens

{MT}_{L}

. Then, we concatenate them with an additional token

M_{C L S}

together as the input for the fusion encoder transformer. The additional token

M_{C L S}

allows cross-attention between the projected vision and motion representations and makes the fusion of visual tokens and motion tokens. For retrieval tasks, the final hidden state output

h_{m_c l s}

is used as final human feature representations.

3.4. Training and Inference

In the training process, we first pre-train the motion-aware transformer module with video datasets in order to obtain stable and effective queries in the MAT module. The existence of the visual adapter module enables the training process to support both video datasets and image datasets. Then, we activate the normal training process of using benchmark datasets for comparison. Our proposed method is trained in an end-to-end manner. The objective function consists of the two following parts:

L = λ_{g} L_{g} + λ_{m c} L_{m c},

(5)

where

λ_{g}

and

λ_{m c}

are scaling factors. For the final ReID tasks, we calculate cross-entropy loss and triplet loss [33] for identification with the ground truth as follows:

L_{g} = L_{c} (h_{m_c l s}) + L_{t} (h_{m_c l s}),

(6)

where

h_{m_c l s}

are the output of fusion encoder,

L_{c}

represents the cross-entropy loss, and

L_{t}

represents the triplet loss.

In the inference stage, we only use the

h_{m_c l s}

token from the last layer of the fusion encoder as the representative information of each image for the subsequent retrieval tasks.

4. Experiments

In this section, quantitative and qualitative experiments are presented to demonstrate the effectiveness of the proposed network. We first introduce the implementation details of our experiments. Then, we conduct quantitative experiments on both image- and video-based datasets, including Market-1501 [34], Partial-iLIDS [35], Partial REID [36], Occluded REID [37], MARS [38], LS-VID [39], iLiDS-VID [40], and PRID-2011 [41]. Finally, adequate ablation studies are performed to prove the effectiveness of each module.

4.1. Datasets and Evaluation Metrics

Market-1501 [34] contains 12,936 training images of 751 persons, 19,732 query images, and 3368 gallery images of 750 persons captured from 6 cameras. It is a holistic dataset.

Partial-iLIDS [35] contains a total of 238 images from 119 people captured by multiple cameras, and their occluded regions are manually cropped.

Partial REID [36] is an especially designed partial person ReID benchmark. It involves 600 images from 60 people. We take the occluded query set and holistic galley set for the experiments.

Occluded REID [37] contains 2000 images belonging to 200 identities. Each identity has five full-body person images and five occluded person images with different viewpoints and different types of severe occlusions.

MARS [38] is collected by 6 near-simultaneous cameras. It contains 1261 different pedestrians, each captured by at least 2 cameras.

LS-VID [39] utilizes a 15-camera network and selects 4 days for data recording. It contains 14,943 sequences of 3772 pedestrians, and the average sequence length is 200 frames.

iLiDS-VID [40] is extracted from the iLIDS MCTS dataset with 600 videos of 300 identities. Due to the limitations of the iLIDS MCTS dataset, the iLIDS-VID occlusion is very severe.

PRID-2011 [41] has 385 videos from camera A and 749 videos from camera B, where only 200 people appear in both cameras at the same time.

Evaluation metrics. We adopt Cumulative Matching Characteristic (CMC) curves and mean average precision (mAP) to evaluate the quality of different Re-ID tasks.

4.2. Implementation Details

Our model training is divided into two phases in total. In the first phase, we pre-train the model using the training set of all the image and video datasets mentioned above, where the image is considered as a video with

T = 1

. Due to the presence of the visual adapter, our network can accept inputs from both images and videos, and the sampling method used is the random sampling of images and videos. In the second stage, fine tuning is performed on each dataset using its training set.

Images and video frames are all resized to

256 \times 128

. The patch size is set to 16. For video data, every batch has 32 clips, which correspond to 8 identities. The layers of the motion-aware transformer and the fusion encoder are both set to 6. For image data, every batch contains 8 identities, each including 4 different perspectives. The network is trained over 120 epochs and optimized by the Adam optimizer with a weight decay of

0.005

. We also use random flipping and random erasing with a probability of

0.5

for data augmentation. The number of learnable queries in the motion-aware transformer is set to 10.

λ_{g}

and

λ_{m c}

in Equation (5) is set to 1 and

0.5

, respectively. In the test stage, we use all frames in units of 4-frame clips and obtain the final video feature by averaging all those

h_{m_c l s}

, and the cosine similarity is used for retrieval. Additionally, the motion consistency task is only activated during the training stage. For a single image, we directly assign the input image as the source image, and the target image is randomly selected from the different perspectives of the same ID in the same batch. For the video frames (e.g., a 4-frame video clip), we assign the first frame as the source image, and the target image is randomly selected from the three remaining frames of the same video.

4.3. Comparison with State-of-the-Art Methods

Comparisons on Video-based Datasets. On account of the design of the visual adapter, our method could deal with video-based ReID tasks. As shown in Table 2, when comparing our approach with state-of-the-art methods, we have achieved comparable performance. Especially on the MARS and LS-VID datasets, we have achieved the highest Rank-1 and mAP. The results of the video-based methods have demonstrated that our visual adapter enables us to distill the spatio-temporal features and the dependencies among video frame features.

Comparisons on Holistic Datasets. The results on the Market-1501 are shown in Table 3. It is clear that our method achieves the best performance compared to other state-of-the-art methods and with other part-based or global-based methods. This demonstrates that our approach could obtain more representative features for holistic pedestrians.

Comparisons on Occlusion Datasets. The results on the Occluded-REID are shown in Table 4 and the results on the Partial datasets are shown in Table 5. When comparing our method with state-of-the-art methods, we achieve the highest Rank-1 and mAP on both Occluded-REID and Partial datasets. Especially on the partial datasets, we achieve 89.2%/93.2% and 77.5%/89.6% on Rank-1/Rank-3 for Partial-REID and Partial-iLIDS, respectively.

As we all know, occluded ReID tasks always need more refined part features to perform pedestrian retrieval. Through these experiments, our method has demonstrated a strong ability to cover occlusions. The reason could be summed up in two ways: On the one hand, the motion consistency task guides the model to extract human part features and aggregate the motion information using learnable queries. On the other hand, the fusion encoder combines the motion information with original vision features, which helps the model generate discriminative representations.

4.4. Ablation Studies

Experiments on motion consistency task. When we delete the motion consistency task from the motion-aware transformer, we directly utilize the learnable queries in the motion-aware transformer to extract human part features. As we can see in Table 6 #1, #2, #7, and #8, our approach appears to perform poorly on Occluded ReID tasks and yield similar results on holistic ReID tasks. It is evident that, without the motion consistency task, our method pays more attention to modeling the pedestrian appearance features, which will be greatly influenced by any occlusion.

Additionally, unlike previous motion-based methods [21,23] that directly used motion information, such as [24], in an explicit way, our method first obtains implicit motion information through a motion-aware transformer and a motion consistency task. Then, a fusion encoder is utilized to fuse the visual features and the implicit motion tokens in an explicit way. Here, we have demonstrated that the introduction of motion information is the main factor in dealing with occluded situations.

Experiments on visual adapter and motion-aware transformer. To provide a feeling of the importance of the visual adapter and the motion-aware transformer, we add two ablation experiments and summarize results in Table 6. As shown in Table 6 #3, our methods can only handle video-based-related ReID tasks when the visual adapter is introduced. As shown in Table 6 #6 and #8, removing VA affects the accuracy of video-based tasks (MARS) (

1.3 %

drop) more than other image-based tasks. This is because the learnable latent queries of the VA module can aggregate and learn spatio-temporal information for multiple video frames, while this has no effect on still images. The MAT module utilizes motion information to focus more on human part segments. As shown in Table 6 #5 and #8, the absence of MAT largely reduces the accuracy of O-REID (

3.2 %

drop) and P-REID (

6.1 %

drop) as the attention of human-part segmentation is affected by background noises.

Experiments on the length of learnable queries in visual adapter. We conduct supplement experiments (Table 7) regarding the different latent query lengths of the visual adapter module. In contrast to previous methods [12], the query length can be adjusted to fit different datasets. If we increase the query length to 256, the accuracy of the video task outperforms the method [12]. However, if the query length (512) is too long, the model will be over-fitted, thus reducing the model accuracy.

Experiments on the length of learnable queries in motion-aware transformer. We change the length of the learnable queries in the motion-aware transformer since the length of the queries corresponds to the degree of refinement of human parts. In other words, the length of the query determines how many parts the human body will be split into. We evaluate different lengths of queries on Market-1501, Occluded-REID, Partial-REID, and MARS. As shown in Table 8, the ablation experiments show the same trend: as the length increases, the Rank-1 index always increases, then decreases, and reaches a peak at 10. Obviously, on the occluded and partial datasets, when the length is set greater than 10, our method experiences a sudden decline in Rank-1. This phenomenon proves our method is sensitive to the length of learnable queries. It is obvious that when our model needs to focus on more human parts, there will be many unnecessary distinguishments, which will directly increase the difficulty of modeling the human body. On the contrary, on a holistic REID task, the Rank-1 is not particularly dramatic. This phenomenon proves two things. Firstly, occlusions rely more on refined human part features, but holistic situations rely more on global features. The suitable length of the queries setting will help the model learn the body parts in sufficient detail. Secondly, the fusion encoder is able to successfully aggregate local and global features into the hidden state

h_{m_c l s}

.

Reviewing previous human part-based methods, their part-aware masks may not benefit from suppressing the disturbance and further grouping human part features, which may be the main reason why some part-aware masks maintain high confidence scores in the background area. It is worth mentioning that their design achieves great performance, but still pays more attention to human appearance features, which shows limitations in occluded situations. In contrast to these methods, by introducing the motion information from still images via a motion consistency task, we do not only focus on the pedestrian appearance features but also on their representative motion information, which makes our model more robust.

Experiments on either pre-trained on video datasets. As shown in Table 9, if we do not utilize the video dataset to pre-train the motion-aware transformer, the performance of our method will be affected to some extent. It brings about a 0.6∼0.7% decrease in Rank-1 and an approximate 0.5% decrease in mAP. A very important part of our work is how to extract the motion information of the human body parts from the static images. The motivation here for pre-training the motion-aware transformer is also to give the model some prior knowledge of motion information for the subsequent training of static images, which will make the training process smoother and more stable. Note that the visual adapter is the main factor to help us to acquire the motion prior knowledge from the video data. But this does not mean that our model cannot be trained from scratch. Without the pre-training process for the video dataset, our model is still able to achieve competitive performance. It is just more eager for more training epochs.

4.5. Visualization

Overview. Our proposed framework for ReID is based on motion information, and after sufficient training, the network will focus its attention on pedestrians, as shown in Figure 4a. Figure 4b also shows the effect of the motion consistency task that we have introduced, whose main purpose is to learn the features of each part of the human body; this whole process is similar to a semantic segmentation process, and the occlusions, such as cars, plates, will be considered as background noise for the motion consistency task. As shown in Figure 5a, we show the image-based and video-based attention maps from the visual adapter module which sense the approximate range of the human body and suppress the background noise. As shown in Figure 5b, the latent token introduced in the motion consistency task can be found through its attention graph that the latent token pays more attention to the detailed features of each part of the human body, and at the same time, it has a certain suppression effect on the background noise. Therefore, the method proposed in this paper can effectively reduce the background noise caused by objects that may change their positions, such as cars and license plates.

The visualization of human part segmentation. We visualize the human part segmentation from the motion consistency task in Figure 4. In Figure 4a, we can intuitively find that after adding motion information, the attention map is more focused on the human body and has a strong resistance to occlusions. In Figure 4b, when the pedestrian rides a bike in various postures or is occluded by a car, the model is still able to effectively distinguish between human and obscured. It is worth noting that all of these enhancements are made possible by introducing motion information.

The visualization of attention map and the CMC curve. Taking one still image or several video frames as input, the visual adapter module will first extract coarse global features (Figure 5a) of the approximate motion area. Then, the MAT module further learns the refined motion information for each part of the human body from the coarse global features and provides more distinguishable human part features (Figure 5b). As shown in Figure 6, we demonstrate that introducing the motion information can reduce the sensitivity to human appearance. At last, as shown in Figure 7, we provide the CMC curve for every datasets.

4.6. Limitations

Our method is sensitive to the length of the learnable queries in the motion-aware transformer, which is proven in Section 4.4. Since our approach is to process video data input in the form of a visual adapter, it entails a certain degree of information loss. Meanwhile, there is still room for improvement in the extraction of spatio-temporal information. Additionally, the best performance first requires pre-training the motion-aware transformer on video datasets.

5. Conclusions and Future Work

In this paper, we rethink the motion information for person re-identification and propose a motion-aware fusion network. In contrast to previous methods, on the one hand, our method is able to simultaneously deal with image data and video data by introducing a visual adapter. On the other hand, our method enables us to obtain implicit motion information, not only from video data, but also from still image data. Moreover, the implicit motion information is fed to a fusion encoder for deeply modeling the relationship between vision features and corresponding motion information. To this end, our method achieves new state-of-the-art results on both holistic and occluded ReID datasets. Furthermore, we show the competitive performance on video-based datasets. In the future, considering the limitations of our proposed method, we aim to develop a novel way to utilize the spatio-temporal information effectively without the pre-training stage of the motion-aware transformer module.

Author Contributions

Conceptualization, H.L. and X.C.; methodology, H.L.; software, H.L.; validation, H.L. and X.C.; formal analysis, H.L.; investigation, X.C.; resources, X.C.; data curation, X.C.; writing—original draft preparation, H.L.; writing—review and editing, H.L.; visualization, X.C.; supervision, H.L.; project administration, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yang, Y.; Yang, J.; Yan, J.; Liao, S.; Yi, D.; Li, S.Z. Salient color names for person re-identification. In Proceedings of the ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the CVPR, Boston, MA, USA, 1–12 June 2015. [Google Scholar]
Zheng, W.S.; Gong, S.; Xiang, T. Reidentification by relative distance comparison. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 3, 653–668. [Google Scholar]
Zhang, T.; Xu, C.; Yang, M.H. Robust structural sparse tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 473–486. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Xu, C.; Yang, M.H. Learning multi-task correlation particle filters for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 365–378. [Google Scholar] [CrossRef] [PubMed]
Tan, L.; Dai, P.; Ji, R.; Wu, Y. Dynamic Prototype Mask for Occluded Person Re-Identification. In Proceedings of the ACM MM, Lisboa, Portugal, 10–14 October 2022. [Google Scholar]
Li, Y.; He, J.; Zhang, T.; Liu, X.; Zhang, Y.; Wu, F. Diverse Part Discovery: Occluded Person Re-Identification with Part-Aware Transformer. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Wang, Z.; Zhu, F.; Tang, S.; Zhao, R.; He, L.; Song, J. Feature Erasing and Diffusion Network for Occluded Person Re-Identification. In Proceedings of the CVPR, New Orleans, LO, USA, 19–23 June 2022. [Google Scholar]
Li, J.; Zhang, S.; Huang, T. Multi-scale 3d convolution network for video based person re-identification. In Proceedings of the AAAI, Honolulu, HI, USA, 27 January 2019. [Google Scholar]
Gu, X.; Chang, H.; Ma, B.; Zhang, H.; Chen, X. Appearance-preserving 3d convolution for video-based person re-identification. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Yan, Y.; Qin, J.; Chen, J.; Liu, L.; Zhu, F.; Tai, Y.; Shao, L. Learning multi-granular hypergraphs for video-based person re-identification. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Bai, S.; Ma, B.; Chang, H.; Huang, R.; Chen, X. Salient-to-Broad Transition for Video Person Re-Identification. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Wu, J.; He, L.; Liu, W.; Yang, Y.; Lei, Z.; Mei, T.; Li, S.Z. CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Eom, C.; Lee, G.; Lee, J.; Ham, B. Video-based person re-identification with spatial and temporal memory networks. In Proceedings of the ICCV, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
He, L.; Liu, W. Guided saliency feature learning for person re-identification in crowded scenes. In Proceedings of the ECCV, Glasgow, UK, 2–28 August 2020. [Google Scholar]
Gao, S.; Yu, C.; Zhang, P.; Lu, H. Ped-Mix: Mix Pedestrians for Occluded Person Re-identification. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Shenzhen, China, 14–17 October 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 265–277. [Google Scholar]
Li, J.; Wu, W.; Zhang, D.; Fan, D.; Jiang, J.; Lu, Y.; Gao, E.; Yue, T. Multi-Pedestrian Tracking Based on KC-YOLO Detection and Identity Validity Discrimination Module. Appl. Sci. 2023, 13, 12228. [Google Scholar] [CrossRef]
Ni, H.; Li, Y.; Gao, L.; Shen, H.T.; Song, J. Part-aware transformer for generalizable person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 11280–11289. [Google Scholar]
Somers, V.; De Vleeschouwer, C.; Alahi, A. Body part-based representation learning for occluded person Re-Identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1613–1623. [Google Scholar]
Miao, J.; Wu, Y.; Yang, Y. Identifying visible parts via pose estimation for occluded person re-identification. IEEE Trans. Neural Networks Learn. Syst. 2021, 33, 4624–4634. [Google Scholar] [CrossRef] [PubMed]
Kiran, M.; Bhuiyan, A.; Blais-Morin, L.A.; Ayed, I.B.; Granger, E. Flow guided mutual attention for person re-identification. Image Vis. Comput. 2021, 113, 104246. [Google Scholar] [CrossRef]
Davila, D.; Du, D.; Lewis, B.; Funk, C.; Van Pelt, J.; Collins, R.; Corona, K.; Brown, M.; McCloskey, S.; Hoogs, A.; et al. MEVID: Multi-view Extended Videos with Identities for Video Person Re-Identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1634–1643. [Google Scholar]
Yin, J.; Wu, A.; Zheng, W.S. Fine-grained person re-identification. Int. J. Comput. Vis. 2020, 128, 1654–1672. [Google Scholar] [CrossRef]
Siarohin, A.; Roy, S.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; Sebe, N. Motion-supervised Co-Part Segmentation. In Proceedings of the ICPR, Virtual Event, 10–15 January 2021. [Google Scholar]
Jakab, T.; Gupta, A.; Bilen, H.; Vedaldi, A. Unsupervised learning of object landmarks through conditional image generation. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Zheng, L.; Huang, Y.; Lu, H.; Yang, Y. Pose-invariant embedding for deep person re-identification. IEEE Trans. Image Process. 2019, 28, 4500–4509. [Google Scholar] [CrossRef] [PubMed]
Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; Sebe, N. First order motion model for image animation. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Jaegle, A.; Gimeno, F.; Brock, A.; Vinyals, O.; Zisserman, A.; Carreira, J. Perceiver: General perception with iterative attention. In Proceedings of the MLR, Virtual, 18–24 July 2021. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the ICCV, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Lin, H.; Cheng, X.; Wu, X.; Shen, D. Cat: Cross attention in vision transformer. In Proceedings of the ICME, Taipei, Taiwan, 18–22 July 2022. [Google Scholar]
Zhang, Y.; Guo, Y.; Jin, Y.; Luo, Y.; He, Z.; Lee, H. Unsupervised discovery of object landmarks as structural representations. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Zheng, W.S.; Gong, S.; Xiang, T. Person re-identification by probabilistic relative distance comparison. In Proceedings of the CVPR, Providence, RI, USA, 20–25 June 2011. [Google Scholar]
Zheng, W.S.; Li, X.; Xiang, T.; Liao, S.; Lai, J.; Gong, S. Partial person re-identification. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Zhuo, J.; Chen, Z.; Lai, J.; Wang, G. Occluded person re-identification. In Proceedings of the ICME, San Diego, CA, USA, 23–27 July 2018. [Google Scholar]
Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; Tian, Q. Mars: A video benchmark for large-scale person re-identification. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Li, J.; Wang, J.; Tian, Q.; Gao, W.; Zhang, S. Global-local temporal representations for video person re-identification. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Wang, T.; Gong, S.; Zhu, X.; Wang, S. Person re-identification by video ranking. In Proceedings of the ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Hirzer, M.; Beleznai, C.; Roth, P.M.; Bischof, H. Person re-identification by descriptive and discriminative classification. In Proceedings of the SCIA, Ystad, Sweden, 1 May 2011. [Google Scholar]
Liu, X.; Zhang, P.; Yu, C.; Lu, H.; Yang, X. Watching you: Global-guided reciprocal learning for video-based person re-identification. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Chen, Z. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
He, L.; Wang, Y.; Liu, W.; Zhao, H.; Sun, Z.; Feng, J. Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Sun, Y.; Xu, Q.; Li, Y.; Zhang, C.; Li, Y.; Wang, S.; Sun, J. Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-guided human semantic parsing for person re-identification. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Gao, S.; Wang, J.; Lu, H.; Liu, Z. Pose-guided visible part matching for occluded person ReID. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]

Figure 1. MAF model is capable of handling ReID tasks based on both image and video inputs, and is able to achieve competitive results on the ReID task for both images and videos.

Figure 2. An overview of the proposed motion-aware fusion network for person re-identification. In (a), the overall network consists of two branches, one for extracting visual features from video and image inputs and the other for mining implicit motion information. In (b), we show the detail of our visual adapter. In (c), we show the detail of the motion-aware transformer (MAT) module.

Figure 3. The explanation of the motion consistency task. For a more concise presentation of the motion consistency task, we omitted two input images from different views here.

Figure 4. (a) We show the attention map of MAF, and it can be found that the attention of MAF is more focused on the human body region part and less affected by the background noise. (b) We show the intermediate results of the segmentation branch in the motion consistency task, and it can be found that the motion consistency task we introduced can help the MAF better localize more parts of the human body, and thus obtain more fine-grained local features of the human body.

Figure 5. (a) The visualization showing the attention map in the visual adapter module shows that the visual adapter can roughly localize the approximate area of the human body. (b) By visualizing the latent token in the motion consistency task, it can be found that based on the visual adapter, the latent token refines the human body by focusing on different human body parts separately, thus providing the local features of the human body.

Figure 6. The visualization demonstrates the difference in retrieval with or without the introduction of motion information, and it can be found that the introduction of motion information can effectively alleviate the previous problem of dressing similarly.

Figure 7. The visualization of the CMC curve of MAF.

Table 1. A brief summary of classic ReID methods.

Method	Reference	Classification	Evaluation Metrics	Main Findings
PAT [7]	CVPR2021	Image	Rank-1, mAP	Learning human part prototypes
PMFB [19]	TNNLS 2021	Image	Rank-1, mAP	Utilize the pose estimation to localize the human body
Co-attention [20]	ICIP2021	Image	Rank-1, Rank-3	Use semantic segmentation to accomplish Partial-REID
FED [8]	CVPR2022	Image	Rank-1, mAP	Non-pedestrian occlusion augmentation strategy
Flow [21]	IVC 2021	Video	Rank-1, mAP	Use optical flow to obtain spatio-temporal information
SBM [12]	CVPR 2022	Video	Rank-1, mAP	Use temporal relations to obtain human features
MEVID [22]	WACV 2023	Video	mAP, CMC	Propose a multi-view video ReID dataset

Table 2. Performance comparison with state-of-the-art methods on MARS, LS-VID, iLiDS-VID, and PRID-2011 datasets. Our method achieves a competitive performance on four datasets.

Method	MARS		LS-VID		iLiDS-VID		PRID-2011
Method	Rank-1	mAP	Rank-1	mAP	Rank-1	Rank-5	Rank-1	Rank-5
M3D [9]	84.4	74.1	57.7	40.1	74.0	94.3	94.4	100.0
AP3D [10]	90.1	85.1	84.5	73.2	88.7	-	-	-
GRL [42]	91.0	84.8	-	-	90.4	98.3	96.2	99.7
GLTR [39]	87.0	78.5	63.1	44.3	86.0	98.0	95.5	100.0
MGH [11]	90.0	85.8	-	-	85.6	97.1	94.8	99.3
MG-RAFA [43]	88.8	85.9	-	-	88.6	98.0	95.9	99.7
STMN [14]	90.5	84.5	82.1	69.2	91.5	-	-	-
CAViT [13]	90.8	87.2	89.2	79.2	93.3	98.0	95.5	98.9
SBM [12]	91.0	86.2	87.4	79.6	92.5	-	96.5	-
Ours	91.1	86.4	87.6	79.9	92.4	97.9	96.4	100.0

Table 3. Performance comparison with state-of-the-art methods on Market-1501.

Method	Market-1501
Method	Rank-1	mAP
FPR [44]	95.4	86.6
PCB [45]	92.3	77.4
PGFA [46]	91.2	76.8
VPM [47]	93.0	80.8
HOReID [48]	94.2	84.9
ISP [49]	95.3	88.6
PAT [7]	95.4	88.0
TransReID [50]	95.0	88.2
FED [8]	95.0	86.3
DPM [6]	95.5	89.7
Ours	95.7	89.8

Table 4. Performance comparison with state-of-the-art methods on Occluded-REID dataset. Our method achieves the best performance.

Method	Occluded-REID
Method	Rank-1	mAP
FPR [44]	78.3	68.0
PCB [45]	41.3	38.9
AMC+SWM [36]	31.2	27.3
PVPM [51]	70.4	61.2
HOReID [48]	80.3	70.2
PAT [7]	81.6	72.1
TransReID [50]	70.2	67.3
DPM [6]	85.5	79.7
Ours	86.7	81.2

Table 5. Performance comparison with state-of-the-art methods on Partial REID dataset, and Partial-iLIDS dataset. Our method achieves the best performance on two datasets.

Method	Partial-REID		Partial-iLIDS
Method	Rank-1	Rank-3	Rank-1	Rank-3
FPR [44]	81.0	-	68.1	-
AMC+SWM [36]	37.3	46.0	21.0	32.8
PVPM [51]	78.3	87.7	-	-
PGFA [46]	68.0	80.0	69.1	80.9
VPM [47]	67.7	81.9	65.5	74.8
HOReID [48]	85.3	91.0	72.6	86.4
PAT [7]	88.0	92.3	76.5	88.2
FED [8]	84.6	-	-	-
Ours	89.2	93.2	77.5	89.6

Table 6. Ablations of key modules of MAF on Market-1501, Occluded-REID, Partial-REID, and MARS. Here, O-REID is the abbreviation of Occluded-REID and P-REID is the abbreviation of Partial-REID. MCT refers to the motion consistency task. VA refers to the visual adapter. MAT refers to the motion-aware transformer.

ID	MCT	VA	MAT	Market-1501	O-REID	P-REID	MARS
#1	✗	✗	✗	92.7	81.2	81.0	-
#2	✓	✗	✗	94.0	84.3	84.9	-
#3	✗	✓	✗	92.9	81.5	81.2	89.1
#4	✗	✗	✓	94.5	83.2	83.4	-
#5	✓	✓	✗	94.5	83.5	83.1	90.1
#6	✓	✗	✓	95.4	86.3	88.9	-
#7	✗	✓	✓	94.9	83.9	83.6	90.5
#8	✓	✓	✓	95.7	86.7	89.2	91.1

Table 7. The experiments of latent query length in visual adapter on MARS, LS-VID, iLiDS-VID, and PRID-2011. The gray denotes the previous SOTA.

Length	MARS	LS-VID	iLiDS-VID	PRID-2011
Length	Rank-1	Rank-1	Rank-1	Rank-1
128	91.1	87.6	92.4	96.4
256	91.2 (91.0)	87.7 (87.4)	92.8 (92.5)	96.7 (96.5)
512	91.0	87.4	92.3	96.3

Table 8. The experiments about the length of learnable queries in motion-aware transformer on Market-1501, Occluded-REID, Partial-REID, and MARS. Here, O-REID is the abbreviation of Occluded-REID and P-REID is the abbreviation of Partial-REID.

Length	Market-1501	O-REID	P-REID	MARS
Length	Rank-1	Rank-1	Rank-1	Rank-1
2	94.0	84.0	83.7	89.9
4	94.4	84.3	84.2	90.1
6	95.0	85.7	86.3	90.4
8	95.2	86.3	88.9	90.9
10	95.7	86.7	89.2	91.1
16	95.4	85.2	88.1	90.9
32	95.0	84.1	84.0	90.5

Table 9. The experiments on either pre-trained on video datasets. Here, O-REID is the abbreviation of the Occluded-REID and P-REID is the abbreviation of Partial-REID. The gray represents the value affected.

Method	O-REID		P-REID
Method	Rank-1	mAP	Rank-1	Rank-3
w	86.7	81.2	89.2	93.2
w/o	86.1 (−0.6)	79.8 (−0.4)	88.5 (−0.7)	92.7 (−0.5)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Chen, X. Rethink Motion Information for Occluded Person Re-Identification. Appl. Sci. 2024, 14, 2558. https://doi.org/10.3390/app14062558

AMA Style

Liu H, Chen X. Rethink Motion Information for Occluded Person Re-Identification. Applied Sciences. 2024; 14(6):2558. https://doi.org/10.3390/app14062558

Chicago/Turabian Style

Liu, Hongye, and Xiai Chen. 2024. "Rethink Motion Information for Occluded Person Re-Identification" Applied Sciences 14, no. 6: 2558. https://doi.org/10.3390/app14062558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rethink Motion Information for Occluded Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Image Person Re-Identification

2.2. Video Person Re-Identification

2.3. Motion-Guide Segmentation

3. Methodology

3.1. Visual Encoder and Visual Adapter

3.2. Motion-Aware Transformer

3.3. Fusion Encoder

3.4. Training and Inference

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Studies

4.5. Visualization

4.6. Limitations

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI