Resumen
Temporal modeling is a key problem in action recognition, and it remains difficult to accurately model temporal information of videos. In this paper, we present a local spatiotemporal extraction module (LSTE) and a channel time excitation module (CTE), which are specially designed to accurately model temporal information in video sequences. The LSTE module first obtains difference features by computing the pixel-wise differences between adjacent frames within each video segment and then obtains local motion features by stressing the effect of the feature channels sensitive to difference information. The local motion features are merged with the spatial features to represent local spatiotemporal features of each segment. The CTE module adaptively excites time-sensitive channels by modeling the interdependencies of channels in terms of time to enhance the global temporal information. Further, the above two modules are embedded into the existing 2DCNN baseline methods to build an action recognition network based on local spatiotemporal features and global temporal excitation (LSCT). We conduct experiments on the temporal-dependent Something-Something V1 and V2 datasets. We compare the recognition results with those obtained by the current methods, which proves the effectiveness of our methods.