Forward Hand Gesture Spotting and Prediction Using HMM-DNN Model

Elmezain, Mahmoud; Alwateer, Majed M.; El-Agamy, Rasha; Atlam, Elsayed; Ibrahim, Hani M.

doi:10.3390/informatics10010001

Open AccessArticle

Forward Hand Gesture Spotting and Prediction Using HMM-DNN Model

¹

Computer Science Department, Faculty of Science, Tanta University, Tanta 31527, Egypt

²

College of Computer Science and Engineering, Taibah University, Yanbu 966144, Saudi Arabia

³

Mathematics & Computer Science Department, Faculty of Science, Menoufiya University, Menoufia 32511, Egypt

^*

Author to whom correspondence should be addressed.

Informatics 2023, 10(1), 1; https://doi.org/10.3390/informatics10010001

Submission received: 12 October 2022 / Revised: 17 December 2022 / Accepted: 22 December 2022 / Published: 28 December 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Automatic key gesture detection and recognition are difficult tasks in Human–Computer Interaction due to the need to spot the start and the end points of the gesture of interest. By integrating Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), the present research provides an autonomous technique that carries out hand gesture spotting and prediction simultaneously with no time delay. An HMM can be used to extract features, spot the meaning of gestures using a forward spotting mechanism with varying sliding window sizes, and then employ Deep Neural Networks to perform the recognition process. Therefore, a stochastic strategy for creating a non-gesture model using HMMs with no training data is suggested to accurately spot meaningful number gestures (0–9). The non-gesture model provides a confidence measure, which is utilized as an adaptive threshold to determine where meaningful gestures begin and stop in the input video stream. Furthermore, DNNs are extremely efficient and perform exceptionally well when it comes to real-time object detection. According to experimental results, the proposed method can successfully spot and predict significant motions with a reliability of 94.70%.

Keywords:

pattern recognition; gesture spotting; machine learning; hidden Markov models; deep neural networks

1. Introduction

Human–computer interaction has advanced substantially, with new methods and strategies being developed on a regular basis.

The communication abilities of the deaf and gesture-based signaling systems in general have been greatly improved thanks to computer vision and artificial intelligence [1,2].

Identification of specific sign languages is used in sports [3] as well as im applications for smart homes and supported living, including the human action identification [4,5,6,7], pose and posture detection [8,9], physical activity monitoring [10], and control hand gesture recognition sub-domains [11]. Researchers in computer science have used a variety of mathematical models and techniques to solve problems in this field over time [12]. In many software applications and the use of hand gestures in a variety of industries, the progress of computer–human interaction [13] depends on the development of gesture recognition systems. Many applications, such as virtual reality [14,15], games [16], cognitive evaluation [17], and augmented reality [18], now incorporate hand gestures.

Hand gesture recognition has recently been used in human–robot interaction in manufacturing [19,20] and autonomous vehicle operation [21]. Hand gesture recognition aims to recognize and identify gestures in real time. Hand recognition is a method for determining how a hand moves by combining techniques and principles from a variety of disciplines, including image processing and neural networks [22]. In addition, hand gesture recognition offers applications such as communicating with deaf persons who are unable to use sign language.

This study compares and contrasts different algorithms to see which is most preferable in terms of accuracy and response time. This is accomplished through the use of hand gesture spotting and prediction. Using HMM-DNN model-based methods, the structure and mechanism of hand gesture recognition are investigated. The DNN Model is a convolutional neural network technique for object detection in real-time that is exceptionally efficient and effective [23]. An HMM can be used to extract features as well as to understand the meaning of gestures and detect target items.

The HMM-DNN model achieves a good model per second (fps) rate compared to the base HMM Model. Gesture recognition based on the HMM-DNN model does not require as much pre-processing, avoiding the need for filtering or picture enhancement. In comparison to other deep learning models, HMM-DNN is preferable because it is faster, stronger, and more dependable. The high accuracy of HMM-DNN in complex contexts al;lows it to detect motions even in low-resolution picture mode. The trained model is employed for real-time gesture recognition from video feeds as well as for static hand image detection. HMM-DNN models can be separated into two categories depending on whether or not hand motions are included. In the first, static hand postures remain in the same place and gestures refer to the dynamic hand with finger movement, as shown in Figure 1. Human–human communication is a crucial step towards more natural computer communication, and serves as the foundation for creating human–computer communication. Hand gestures are the most common means of human communication, as they are so imprecise [24]. People who have no the ability to use a keyboard and would prefer to be supported by a system that responds to gestures can benefit from entrance gestures. The most natural method for creating a human–computer gestural interface is to use a method of vision-based analysis of hand motions which utilizes one or more cameras to collect hand movements.

Our study’s main contribution is to examine a stochastic approach without training data for employing HMMs to create a non-gesture model that can precisely identify meaningful gestures. Both an adjustable threshold and a level of confidence are provided by the non-gesture model. The main goal of using this adaptive threshold is to pinpoint the beginning and conclusion of gestures that are meaningful and produced by continuous hand motion. The relative entropy function is used to modify the non-gesture model utilizing HMMs in order to address the issue of the increasing number of states. The main goal is to speed up spotting while also preserving time and space. Additionally, to handle hand gesture segmentation and recognition simultaneously, a forward spotting method is used in conjunction with the sliding window methodology. The main objective is to produce accurate and reliable results that are ready for online applications while removing the lag between relevant gesture spotting and recognition.

The main points of this paper are as follows:

To accurately detect meaningful number gestures, a stochastic method for building a non-gesture model using HMMs without training data is proposed (0–9).
A confidence measure that the non-gesture model offers can be used as an adaptive threshold to establish the start and end points of meaningful gestures in the input video stream.
DNNs are extremely efficient, and perform exceptionally well when it comes to real-time object detection. According to our experimental results, the proposed method can successfully spot and predict significant motions with high reliability.
Our main goal is to provide accurate, robust, and online application results while also removing the lag between meaningful gesture spotting and identification.

The structure of our investigation is as follows. Related works are reviewed in Section 2. The method and flowchart of the system are described in Section 3. The Real-Time Hand Gesture Recognition Using Deep Learning HMM-DNN Model is presented in Section 4. The results of the suggested HMM-DNN model are discussed in Section 5. Section 6 concentrates on the experimental findings of the study. Section 7 examines and evaluates the proposed method. Finally, Section 8 concludes up the paper.

2. Related Work

Games, virtual reality, assisted living, manufacturing, and autonomous vehicle operation are all examples in which hand gestures are used. There are numerous hand gesture recognition systems that use both machine learning and deep learning techniques to identify a human hand gesture as it develops.

Deep neural networks are increasingly being used for learning in the digital world. A neural network can detect an object of interest, recognize motions, and extract characteristics. Pedro Neto et al. [25] introduced a data glove interface technique for continuous real-time and recurrent gesture spotting in ANNs. To distinguish between communicative and non-communicative motions, two ANNs in sequence were proposed. The authors suggested a feed forward design with only one hidden layer, with forty-four hidden layers in the input layer and ten in the output layer. They used forty-four neurons in the input layer of each sensor, which correspond to two consecutive (t and

t - 1

) or nonconsecutive (t and

t - n

) signals. The experimental results show that the suggested method has a high recognition rate, a quick learning curve, and a respectable capacity for generalization from scenarios.

Abdullah Mujahid et al. [26] provided a simple gesture recognition model based on YOLO v3 and the DarkNet-53 convolutional neural network that doesn’t need any additional preprocessing. A labeled dataset of hand gestures in the Pascal VOC and YOLO formats was used to test the suggested model. The YOLO convolutional neural network method is excellent at real-time object detection. In a variety of circumstances, hand gesture recognition can be used to enhance control, accessibility, communication, and learning. The authors examined several convolutional neural networks, including their own exclusive model, in their thorough investigation. The Marcel dataset was used to assess each model’s performance and show how different designs affect performance. The GoogLeNet method, which makes use of the Inception architecture, produced the best results, followed by their proprietary method [27]. Xin Gao et al. [28] constructed deep learning models to recognize targets when the target is a sequence pattern. The identification of pertinent characters in a text sequence is made easier with accurate sequence pattern prediction. Despite significant advancements in the application of machine learning to sequence pattern recognition issues, their effectiveness remains limited, as extracting features from raw sequences requires a large amount of manual feature engineering.

This sequence pattern recognition issue can be addressed using deep learning techniques. Each dataset has a consistent pattern, and the sequences are original genomic format sequences. Additionally, a variety of deep learning models have been looked at (including convolutional, recurrent, and combined networks). Sequences are encoded using the one-hot encoding approach, which protects the crucial positional information of each character.

3. Pre-Processing and Feature-Based Tacking

In the process of image acquisition, two types of images are obtained: 2D image sequences and depth image sequences. Depth data are acquired using passive stereo measurement, and rely on mean absolute difference and data on the cameras’ calibration. In our application, the depth value ranges from a minimum of 30 cm to a maximum of 200 cm. The depth range, on the other hand, is adapted to the region of interest. The depth values in the current frame that correspond to the region of interest are averaged. As a result, for each consecutive frame the depth is rearranged with respect to the region of interest, as shown in Figure 2. In addition, the accuracy of skin segmentation is improved by depth information, which neutralizes complex backgrounds.

To track the hand and obtain its motion (i.e., a hand gesture), we employ a mean-shift procedure in conjunction with depth information to extract the set of hand postures. Then the hand postures are connected to build a gesture path. The procedure used for the mean-shift depends on a similarity function, the so-called Bhattacharyya coefficient, to acquire the candidate hand most similar to the target hand. By connecting the centroid locations of the hand area to determine the hand gesture path, mean-shift analysis aims to achieve accurate hand tracking. The obtained gesture’s trajectory points are then smoothed in order to effectively account for these unanticipated shifts [30]. The motion trajectory of the hand is called the gesture path, consisting of spatio-temporal patterns made up of the centroid points of the hand regions

(x_{h a n d}, y_{h a n d})

. The choice of appropriate features for recognizing the hand gesture path has a substantial impact on system performance. Location, orientation, and velocity are the three basic characteristics. Two sorts of location features are evaluated for this purpose. The first location feature is denoted by the symbol

L c

, that determines the distance from the gesture path’s centroid point to all other points. This is due to the fact that several location characteristics are created for the same gesture based on distinct starting positions.

L s c

is the second location property, that is calculated from the gesture path’s start point to the current position. Another important feature is the orientation, which plays a large role in the identification of hand gestures. In our research, we depend on three orientations: (

θ_{1 t}

), between each point and the point of gesture centroid;

θ_{2 t}

, the orientation of two succeeding points; and

θ_{3 t}

, which computes the hand displacement vector at each point. The final fundamental attribute is velocity, which is crucial throughout the gesture recognition phase, especially in certain key instances. The velocity is determined by how quickly each hand gesture is produced. The Euclidean distance between the two places is multiplied by the time t to determine the velocity in this case. As a result, the gesture path is expressed as the vector of the features vector, which is clustered and projected in space to produce discrete codewords. This is carried out using the k-means clustering approach, which divides the hand gesture into “K” clusters in the feature space to produce discrete symbols that can be fed into the classifier [30]. The reason for employing the k-means algorithm stems from its ease of representation, scalability, speed of convergence, and adaptability to sparse data.

Here, Figure 3 explores the trajectories of cluster for the hand gesture paths for ‘3’ and ‘5’, which are forecast in accordance with the

(L c, L s c, θ_{1}, θ_{2}, θ_{3}, V)

characteristics. From frame 21 to frame 43, the cluster trajectories for the gesture pathways ‘3’ and ‘5’ exhibit substantially identical cluster indices. As a result, the reality of the combined characteristics

(L c, L s c, θ_{1}, θ_{2}, θ_{3}, V)

is established.

4. Deep Neural Network

Traditional sign language translation technologies use Hidden Markov Models and linear classifiers like Support Vector Machine (SVM) and kNN to categorize hand motion images. These techniques, though, call for the use of complex categorization traits. In order to automate the processes of feature extraction and feature selection, this paper suggests a DNN-based gesture recognition approach. The motion trajectory of hands is recognized using a Deep Learning (DL) technique. For this, a custom DNN is constructed that includes three convolutional layers in addition to three max-pooling layers. A three-layer Deep Convolutional Neural Network is used in the proposed method to recognize hand gestures.

In Figure 4, the proposed DNN for the classification of number gestures from 0 to 9 signs is displayed, along with the dimension information for each layer. With the initial data indicating the RGB channel and the subsequent data indicating the input image dimension, the input layer’s dimension is (3,128,128). Each of the 16 filters in the first ConvNet block has a size of 5, and the max-pooling layer comes next with a size of 2, before the final two ConvNet layers. To equalize the weights among the convolutional layers, a batch normalization layer is employed. After the ConvNet blocks, the weights are reduced by 0.4 and the neural layers are flattened. Utilizing three dense layer blocks, the number of output neurons is decreased to correspond to the final dimension of ten neurons (i.e., an Artificial Neural Network). Here, ten neurons reflect ten various gestures that can be detected. Finally, the softmax classification layer is used to predict hand motions for the output layer.

Additionally, the ConvNet weights are saved at the learning phase’s highest accuracy level. There are four batches per learning epoch, which is twenty. The loss function in this work also makes use of the cross-entropy category. The loss function is thought to modify the DNN’s weight vector in order to minimize learning error. During the learning process, the Stochastic Gradient Descent (SGD) weight optimizer is used to hasten the convergence of the DNN model to the ideal neural weights. In the proposed model, the deep learning task has a learning rate of 0.01 and momentum of 0.5. The total number of learning images for the trained DNN model is split into two categories: 80% training and 20% validation. The weights are altered after each trial of the ten-try training run. As a result, the trial run produces the most accurate results. To infer intermediate levels of the DNN, The subsequent max-pooling layer is used after the activation of the first convolution layer, which utilizes ‘ReLU’. We use a (2,2) mask function to downsample the input layer. Additionally, the input image is classified into one of the most likely gesture numbers using the classification layer and the softmax function.

5. Spotting and Prediction Approach

The primary contribution of this paper is the presentation of a forward gesture spotting technique that is capable of performing gesture spotting and recognition. The time lag between the spotting and recognition processes is also eliminated by this method. The use of the HMMs and DNN for hand gesture detection and recognition is described in the following sections.

5.1. Spotting with HMMs

HMMs use a superior process to model the spatiotemporal time sequences of gestures, and can accept non-gesture patterns without any training data (a garbage model or filler model). The non-gesture technique is offered for precisely identifying important gestures. The starting and ending points of meaningful hand gestures present in the input video sequences are identified using the non-gesture model as a confidence measure (i.e., an adaptive threshold). In the next subsections, the process of constructing a non-gesture model with respect to gesture references is described (see Figure 5).

5.2. Gesture Model

Each hand gesture for the numbers “0” to “9” was constructed using the HMM parameters

λ = (π, A, B)

. In every reference gesture, the HMM state represents the local segmented component, while the transition between states is used for the gesture path’s sequential order structure. It should be noted that the number of HMM states is an important factor, and that using more states can result in overfitting if there aren’t enough samples of training data. In fact, more than one segmented meaningful part of a graphical pattern may be present in one state. If an inadequate number of states is utilized, the discriminating power of the HMMs is diminished. In the gesture spotting technique, each straight-line segment (i.e., key gesture) is assigned to a single state of the HMM depending on each individual hand gesture’s complexity (Figure 6).

Left–right banded topology (LRB) is of great importance when modeling each reference gesture. In the ergodic model, there is more one transition of the LRB topology per state; hence, the structural data can easily be misplaced. In addition, there is no backward transition in LRB, meaning that the index of state remains constant through time. Furthermore, LRB topology has fewer limitations than ergodic topology, and it is easier to carry out the training and testing processes. As a result, the Baum–Welch procedure is critical in our approach, as it is employed to finish the complete training process according to the initialized parameters of the HMMs

λ = (π, A, B)

. Interested readers may wish to review [31] for additional information.

5.3. Non-Gesture Model

Collecting non-gesture patterns is difficult due to the infinite variants of meaningless motion. As a result, we build a single hidden Markov model known as the non-gesture model which is used to explain all non-reference patterns, a so-called garbage model [31]. The non-gesture model, unlike any gesture model, acts with respect to any motion trajectory or portion thereof. We cannot be sure whether a pattern is near to the reference gesture model, even if its probability value represents the highest among all the other reference gestures. In this case, the HMM recognizer chooses the model with the highest probability. Here, a non-gesture model provides strong evidence to reject non-gesture graphical patterns. It should be noted that the self-transition for each state in the HMM model reflects a line segment with a meaningful pattern. Additionally, the HMM’s internal segmentation property together with the outward transition among states provides the rest of the gesture’s sequential segmented patterns. An ergodic model is created using this characteristic, with the states copied from every reference hand gesture with the characteristic of being fully connected (Figure 7).

To simplify the structure, two dummy states are added, which are undetectable for any time delay. By copying all the states of every gesture model in proposed system, the non-gesture model is constructed in the following way:

First, we copy all states of each hand gesture model along with their output observation $b_{j} (m)$ . Then, using a Gaussian distribution smoothing filter, we re-estimate the probabilities to define the states such that they act for any pattern. Then, the floor process is smoothed.

$N o n - g e s t u r e (b_{j} (m)) = \frac{1}{\sqrt{2 π σ}} \cdot e x p (\frac{{(b_{j} (m))}^{2}}{2 σ^{2}})$

(1)
We replicate the probability of self-transition states in the gesture models, as every state reflects a meaningful unit (i.e., segmented graphical pattern) of the hand gesture. Therefore, the quantity of those components determines the target gestures.
The following formula is used to calculate all outbound transition probabilities:

$\hat{a_{i j}} = \frac{1 - a_{i j}}{N - 1}, f o r a l l j, i \neq j$

(2)

The transition probability of the non-gesture model from state

s_{j}

to state

s_{j}

is represented in this case by

\hat{a_{i j}}

, whereas

a_{i j}

expresses the transition probabilities of the gesture models from state

s_{i}

to state

s_{j}

. N represents the total set of states for all gesture models. A straightforward model called a “non-gesture model” shows every possible pattern for each and every trained gesture model. The likelihood of the non-gesture model for a particular gesture is the lowest among the dedicated gesture models as a result of the low forward transition probabilities (Figure 8). The non-gesture model provides a confidence metric as a gauge of probability to distinguish the significant gestures. The value of the differential probability is used to support this measurement. This value is determined by comparing the maximal gesture models’ observation probabilities to those of non-gesture models for a particular input pattern. The confidence measure of this value is regarded as an adaptive threshold for gesture spotting or selecting the best gesture model.

The differentiation of input patterns becomes computationally expensive as the number of states in a non-gesture model increases. The most obvious advantage of using relative entropy is the decrease in the number of states in the non-gesture model [31]. The computation proceeds more quickly as a result, and less time and space are needed.

5.4. Gesture Spotting Network

To spot meaningful (i.e., key) gestures, the network for spotting gestures is built as shown in Figure 9.

There are ten different models in this network, one for each of the ten number gestures from ‘gesture0’ to ‘gesture9’. The LRB model, which has three to five states depending on the complexity, is used to generate these ten models. This network is rebuilt after reducing the states using the relative entropy measure. In addition, it includes the dummy start, which denoted by the symbol S. The network for hand gesture spotting finds the beginning and ending points of gestures that are meaningful and present in the input video stream, segmenting and identifying the gestures as it does so.

5.5. Spotting and Recognition

To spot gestures correctly with no time delay, we employ the forward spotting method, in which a differential probability value (denoted by the symbol

D P

) is equal to the difference between the observation probability of the non-gesture model and the maximal gesture models (Figure 10). A maximum gesture is one with the highest probability value, or

p (O | λ_{g})

, among all other gestures, where g is the index of gesture models, which range from 0 to 9.

It should be noted that if the

D P

value changes from negative to positive, the transition from non-gesture to gesture occurs, making O possible as gesture g. Similarly, O cannot be a gesture when the value of

D P

moves from positive to negative (in that case, the shift occurs from gesture to non-gesture (Equation (4)). As a result, the following findings are used as a rule to determine where gestures begin and end. The

D P

value here represents an adjustable threshold for spotting key hand gestures.

\forall g : P (O | λ_{g}) < P (O | λ_{n o n - g e s t u r e})

(3)

\exists g : P (O | λ_{g}) > P (O | λ_{n o n - g e s t u r e})

(4)

The suggested hand gesture spotting method is made up of two primary components, the segmentation module (known as a spotting module) and the recognition module. A sliding window approach is employed in the gesture segmentation module. This technique uses the

D P

value to compute the observation probability of the non-gesture model in addition to the ten gesture models for the observed segmented patterns. Instead of a single observation, the sliding window (

S w

) contains several sequential observations (see Figure 11).

The sliding window is used to mitigate the effects of short-term observation changes caused by insufficient feature extraction. The best sliding window value is determined empirically. Several experiments on the proposed system were carried out with various sliding window sizes ranging from 1 to 8 in order to experimentally determine the best outcome. The system was determined to be the best in terms of outcomes when using a value of 5. After detecting the start point of a key gesture from the continuous image sequences, the gesture recognition module is turned on and performs the recognition procedure gradually to segment the pattern until the key gesture’s end signal is received. Throughout this process, the DNN model is activated to recognize the key gestures. The processes of this technique is repeated until no more gesture images are being entered. Figure 11 depicts the sliding window’s working and the accumulative recognition of visible sequences.

To show how the technique works, let us assume that the sliding window size is denoted by

S w

such that the sequence of input observation sequences O is assigned as

{o_{1}, o_{2}, . . ., o_{t}, . . ., o_{T}}

with length T. Here, we initialize the window size with the observation sequence

O_{t = 0} = {o_{1}, o_{2}, . . ., o_{S w}}

to calculate the

D P

value (Equation (5)):

D P (t) = max_{g} P (O_{t} | λ_{g}) - P (O_{t} | λ_{N o n - g e s t u r e})

(5)

If the

D P (t)

value is less than zero (i.e., negative), the starting point of a key gesture is not spotted, and the sliding window is shifted one unit to become

O_{t + 1} = {o_{t + 1}, o_{t + 2}, . . ., o_{S w + t}}

. This task is iterated until the value of

D P

becomes positive, in which case the DNN is activated to carry out classification.

The segmentation of observed key gestures is connected via the union of all partial gesture segments

O^{^{'}} = {O_{1}^{^{'}} \cup O_{2}^{^{'}} \cup . . .}

. Furthermore, the type of gesture

O^{^{'}}

is identified using the DNN at each phase. Therefore, the final type of gesture g with respect to an observed gesture segment

O^{^{'}}

is selected until no gesture images remain. If more gesture images are found, the last steps are duplicated and

S w

is re-initialized at the next time t. As a result, the forward technique addresses the issue of temporal delay between gesture detection and recognition.

6. Experimental Results and Discussion

The hand region was segmented according to complicated background utilizing a depth map for hand and face detection and color space information of

Y C_{b} C_{r}

for skin segmentation. Because of this, Gaussian Mixture Models were investigated, with a huge data set of skin and non-skin pixels used for training. In addition, to monitor the hand and construct a gesture route, morphological procedures and the mean-shift algorithm were used. Hand gesture route characteristics were recovered based on two separate locations, three different orientations, and velocity. By quantizing the gathered features as input for the spotting network in order to identify the beginning and ending points of meaningful gestures, k-mean clustering was used to obtain discrete symbols. The input images were taken with a Bumblebee stereo camera system using the Matlab and C++ programming languages, at 15 frames per second, a 6 mm focal length, and a

240 \times 320

pixel resolution.

Our database, which contains 600 video clips of isolated gestures collected from three people on a set of numbers, was used to classify the results. Every number gesture from ‘0’ to ‘9’ was created using 60 training videos. Additionally, the database included 280 video clips depicting continuous hand movements for testing. One or more major gestures appear in each video sample.

To determine which model belongs to a test gesture, the gesture recognition module compares it to a dataset of reference hand gestures. Furthermore, the non-gesture model has 40 states before state reduction and 22 states after state reduction. This provides a number of advantages that allow the system to proceed in real-time, thereby saving time and space. The percentage of correctly recognized (true) hand gestures over the total number of tested hand gestures is the recognition ratio (

R e c .

), which is used to assess the proposed system (Equation (6)).

R e c . = \frac{# r e c o g n i z e d h a n d g e s t u r e s}{# t e s t h a n d g e s t u r e s} \times 100

(6)

There are three kinds of errors in the automatic gesture detecting task (Table 1): Deletion (D), Substitution (S), and Insertion (I). When the spotter does not detect an existing gesture, an insertion error occurs, as the observation probability of the current state is equal to zero. If a key gesture is classified incorrectly, a substitution error occurs, as the hand gesture is recognized as different gesture. This issue frequently occurs if the extracted features are incorrectly quantized to another codeword. When the spotter misses a key gesture, a deletion error occurs. Insertion errors are not taken into account when calculating the recognition ratio (Equation (6)). Substitution and deletion errors, on the other hand, are likely to create insertion errors, as they are frequently utilized as a strong control in deciding the gesture end points, and thus remove all or part of a significant gesture according to observation. Insertion errors have no effect on the recognition ratio, while deletion errors have an effect. However, insertion errors have a direct impact on the gesture spotting ratio. The following equation proposes another performance measure, termed reliability (

R e l .

), to account for the effect of insertion errors:

R e l . = \frac{# c o r r e c t l y r e c o g n i z e d h a n d g e s t u r e s}{# t e s t g e s t u r e s + # I e r r o r s} \times 100

(7)

The number of spotting errors is used to calculate the recognition ratio and reliability (Table 2). We used a sliding window technique with a size ranging from 1 to 8 to test the accuracy of gesture spotting (Figure 12a). It should be noted that when the size of the sliding window increases, the accuracy of gesture spotting first improves and then begins to deteriorate as the size of the sliding window increases. We empirically determined the best sliding window size to be 5, which provided 94.70% reliability for the automatic gesture spotting method. Between

S w

= 1 and

S w

= 4, the number of mistakes drops dramatically, as shown in Figure 12b. However, with

S w

equal to 4, the number of deletion, insertion, and replacement errors increases. As

S w

increases, it comprises both gesture and non-gesture observation information, resulting in the loss of meaningful gestures’ starting and stopping locations.

The identification rate of key gestures is shown in Table 2 with window sizes ranging from 1 to 8. In Figure 13, the visual sequences include three essential gestures: ‘3’, ‘2’, and ‘6.’ We only take into account the temporal evolution of the likelihood of gestures ‘2’, ‘3’, and ‘6’, and non-gestures; because their probabilities are low, the other curves are omitted for simplicity. At frame index 37, the gesture ‘3’ comes to an end. The next important gesture’s start point is not visible because the non-gesture label, which has higher priority between frames 37 and 50, can be seen. A new key gesture is launched at frame index 51, where the likelihood value for the label of non-gesture is lower than the other gesture labels. At frame index 75, the gesture ‘2’ comes to an end. Between frame index 75 and frame index 91, the non-gesture label is given more weight. At frame 92, the gesture ‘6’ begins, ending at frame index 121. Furthermore, when applied to multiple video examples featuring perplexing conditions such as occlusion between the hands and face, the suggested system automatically distinguishes key hand motions using HMM+DNN with good performance and low computing complexity, with an accuracy rate of 94.70%.

It should be highlighted that the proposed method has a high recognition rate for spotting gestures, which is attributable to a decent selection of feature candidates for optimally discriminating between input patterns. In addition, the training phase requires thorough experimentally-based selection of the initialization values. Furthermore, HMMs + DNN are capable of effectively alleviating spatiotemporal variabilities. As a result, this system can be used in real-time applications and eliminates time gap between the spotting and recognition tasks.

Backwards spotting techniques work by first detecting a gesture’s end point, then carrying out the process of tracking back over the best pathways to find the gesture’s beginning location. The trajectory in between is submitted to the classifier for recognition after the start and end locations have been detected. Because of this, there is an unacceptable latency delay between noticing and recognizing meaningful gestures for online applications. The primary benefit of the suggested gesture spotting approach is the provision of a forward gesture spotting technique for concurrent segmentation and recognition of hand gestures. For non-gesture graphical patterns, a stochastic method for creating a non-gesture model from HMMs without a learning dataset is also proposed.

Figure 14 illustrates the average time required to perform the backward and forward spotting of key gestures from ‘0’ to ‘9’ at

S w

= 5. Due to the retracing process used to find the key gesture’s start point, reverse spotting is seen to take longer than forward spotting. Determining the gap between identifying the start point and the end points of a key gesture is therefore crucial for system evaluation. As a result, the suggested system can carry out simultaneous tasks for gesture detection and recognition with regard to the gestures for the numbers 0–9. Because this approach unifies the spotting and identification tasks, it is especially useful for real-time implementation.

We constructed a comparison between our suggested strategy and other approaches using comparable experimental setups and datasets in order to achieve a fair comparison. To evaluate the efficacy of our approach, the outcomes were compared to those in [32,33] (Table 3). The conditional random field (CRF) approach was used in [32] to carry out forward spotting and identification for ten gestures. Considering that the necessary modeling time varies depending on the observation window, the training process for CRF in this case is more expensive. The findings demonstrated that with a 90.4% identification rate, the work of [32] was successful in spotting and identifying in the input video stream embedded meaningful gestures.

In [33], a hidden Markov model (HMM) classifier was used to perform spotting based on a forward technique in conjunction with a non-gesture model built from the ten reference gestures. This method achieved promising results, as shown in Table 3. The drawback to HMM is the abundance of unstructured parameters. The first-order Markov property places restrictions on this, making it difficult to communicate relationships between concealed states. Such results are presumably highly satisfying when compared to our earlier research.

7. Evaluation

According to earlier studies, there were about one and half observations needed to create the non-gesture model with HMM. As a result, the temporal complexity C is computed can be used to explore the gesture spotting technique, as follows:

C = L a \bar{N} T + N_{n g}^{2} T

(8)

Here, the number of gesture models is denoted by L (i.e., ten models), the number of transitions for each state is denoted by the symbol a (in this case, two transitions for each state because of the LRB HMM topology). In addition,

\bar{N}

represents the average number of states for all gestures (which in our case

\bar{N}

is 4), and T is to the duration of the observation feature sequence.

Furthermore, the average number of observation states used to build the spotting network is nearly equal to 40. Using relative entropy, the number of states for this network can be reduced from 40 to 22 without any negative effects on its functioning (i.e.,

N_{n g}

= 40 before state reduction and

N_{n g}

= 22 after state reduction).

For circumstances in which more states results in loss of time and space, relative entropy is a useful technique to reduce the number of states. As a result, the following is the estimated rate of reduction in evaluation time (E) for gesture spotting:

E = \frac{(L a \bar{N} T + N_{n g}^{2} T) - (L a \bar{N} T + {\overset{´}{N}}_{n g}^{2} T)}{L a \bar{N} T + N_{n g}^{2} T}

(9)

where

{\overset{´}{N}}_{n g}

denotes the non-gesture model’s minimized number of states. Thus, Equation (9) can be simplified to

E = \frac{N_{n g}^{2} - {\overset{´}{N}}_{n g}^{2}}{L a \bar{N} + N_{n g}^{2}} = \frac{40^{2} - 22^{2}}{(10) \cdot (2) \cdot (4) + 40^{2}} = 0.66 .

(10)

As a result, the expected assessment from Equation (10) in terms of time saved is 66.42 percent.

8. Conclusions

This paper explores an intelligent method for spotting and recognizing hand gestures representing the numbers 0–9 using Hidden Markov Models and Deep Neural Networks. With no training dataset, a stochastic approach for creating a non-gesture model using HMM is proposed to accurately spot meaningful gestures. The non-gesture model provides a confidence measure that is utilized as an adaptive threshold to determine where meaningful gestures begin and stop in the input video stream. Furthermore, DNNs are extremely efficient and perform exceptionally well on real-time object detection tasks. The proposed method can successfully spot and predict significant motions with a reliability of 94.70%, and has the ability to perform simultaneous gesture spotting and recognition tasks with respect to gestures representing the numbers 0–9. This approach is particularly effective for real-time implementations, as it bridges the gap between the spotting and recognition tasks. While this work represents a foot in the door, its remaining complications must be resolved progressively by future work seeking to improve on the achieved results. This can be accomplished by enhancing the existing interaction technologies and developing fresh approaches for the automatic creation of non-gestures, particularly one that uses random gestures as non-communicative gestures. Furthermore, we plan to create a more sophisticated convolution neural network incorporating data fusion, motivated by recent efforts to improve the accuracy of hand gesture recognition.

Author Contributions

Conceptualization, M.M.A. and H.M.I.; methodology, M.E. and R.E.-A.; software, E.A.; validation, M.E., E.A. and R.E.-A.; formal analysis, M.E.; investigation, M.M.A. and M.E.; resources, H.M.I.; data curation, E.A.; writing—original draft preparation, M.M.A.; writing—review and editing, H.M.I.; visualization, M.E.; supervision, M.E. and R.E.-A.; project administration, E.A.; funding acquisition, M.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhao, Y.; Wang, W.; Wang, Y. A real-time hand gesture recognition method. In Proceedings of the 2011 International Conference on Electronics, Communications and Control (ICECC), Ningbo, China, 9–11 September 2011. [Google Scholar] [CrossRef]
Oudah, M.; Al-Naji, A.; Chahl, J. Hand Gesture Recognition Based on Computer Vision: A Review of Techniques. J. Imaging 2020, 6, 73. [Google Scholar] [CrossRef] [PubMed]
Žemgulys, J.; Raudonis, V.; Maskeliūnas, R.; Damaševičius, R. Recognition of basketball referee signals from real-time videos. J. Ambient Intell. Humaniz. Comput. 2019, 11, 979–991. [Google Scholar] [CrossRef]
Al-Hammadi, M.; Muhammad, G.; Abdul, W.; Alsulaiman, M.; Bencherif, M.A.; Alrayes, T.S.; Mathkour, H.; Mekhtiche, M.A. Deep Learning-Based Approach for Sign Language Gesture Recognition With Efficient Hand Gesture Representation. IEEE Access 2020, 8, 192527–192542. [Google Scholar] [CrossRef]
Vaitkevičius, A.; Taroza, M.; Blažauskas, T.; Damaševičius, R.; Maskeliūnas, R.; Woźniak, M. Recognition of American Sign Language Gestures in a Virtual Reality Using Leap Motion. Appl. Sci. 2019, 9, 445. [Google Scholar] [CrossRef] [Green Version]
Rezende, T.M.; Almeida, S.G.M.; Guimarães, F.G. Development and validation of a Brazilian sign language database for human gesture recognition. Neural Comput. Appl. 2021, 33, 10449–10467. [Google Scholar] [CrossRef]
Afza, F.; Khan, M.A.; Sharif, M.; Kadry, S.; Manogaran, G.; Saba, T.; Ashraf, I.; Damaševičius, R. A framework of human action recognition using length control features fusion and weighted entropy-variances based feature selection. Image Vis. Comput. 2021, 106, 104090. [Google Scholar] [CrossRef]
Nikolaidis, A.; Pitas, I. Facial feature extraction and pose determination. Pattern Recognit. 2000, 33, 1783–1791. [Google Scholar] [CrossRef]
Kulikajevas, A.; Maskeliunas, R.; Damaševičius, R. Detection of sitting posture using hierarchical image composition and deep learning. PeerJ Comput. Sci. 2021, 7, e442. [Google Scholar] [CrossRef]
Ryselis, K.; Petkus, T.; Blažauskas, T.; Maskeliūnas, R.; Damaševičius, R. Multiple Kinect based system to monitor and analyze key performance indicators of physical training. Hum.-Centric Comput. Inf. Sci. 2020, 10, 51. [Google Scholar] [CrossRef]
An ANN-based gesture recognition algorithm for smart-home applications. KSII Trans. Internet Inf. Syst. 2020, 14, 1967–1983. [CrossRef]
Abraham, L.; Urru, A.; Normani, N.; Wilk, M.; Walsh, M.; O’Flynn, B. Hand Tracking and Gesture Recognition Using Lensless Smart Sensors. Sensors 2018, 18, 2834. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ahmed, S.; Cho, S.H. Hand Gesture Recognition Using an IR-UWB Radar with an Inception Module-Based Classifier. Sensors 2020, 20, 564. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Alkemade, R.; Verbeek, F.J.; Lukosch, S.G. On the Efficiency of a VR Hand Gesture-Based Interface for 3D Object Manipulations in Conceptual Design. Int. J. Hum.–Comput. Interact. 2017, 33, 882–901. [Google Scholar] [CrossRef]
Lee, Y.S.; Sohn, B.S. Immersive Gesture Interfaces for Navigation of 3D Maps in HMD-Based Mobile Virtual Environments. Mob. Inf. Syst. 2018, 2018, 2585797. [Google Scholar] [CrossRef] [Green Version]
Lee, D.H.; Hong, K.S. Game interface using hand gesture recognition. In Proceedings of the 5th International Conference on Computer Sciences and Convergence Information Technology, Seoul, Republic of Korea, 30 November–2 December 2010. [Google Scholar] [CrossRef]
Negin, F.; Rodriguez, P.; Koperski, M.; Kerboua, A.; Gonzàlez, J.; Bourgeois, J.; Chapoulie, E.; Robert, P.; Bremond, F. PRAXIS: Towards automatic cognitive assessment using gesture recognition. Expert Syst. Appl. 2018, 106, 21–35. [Google Scholar] [CrossRef] [Green Version]
Del Rio Guerra, M.S.; Martin-Gutierrez, J.; Acevedo, R.; Salinas, S. Hand Gestures in Virtual and Augmented 3D Environments for Down Syndrome Users. Appl. Sci. 2019, 9, 2641. [Google Scholar] [CrossRef] [Green Version]
Kaczmarek, W.; Panasiuk, J.; Borys, S.; Banach, P. Industrial Robot Control by Means of Gestures and Voice Commands in Off-Line and On-Line Mode. Sensors 2020, 20, 6358. [Google Scholar] [CrossRef]
Neto, P.; Simão, M.; Mendes, N.; Safeea, M. Gesture-based human-robot interaction for human assistance in manufacturing. Int. J. Adv. Manuf. Technol. 2018, 101, 119–135. [Google Scholar] [CrossRef]
Young, G.; Milne, H.; Griffiths, D.; Padfield, E.; Blenkinsopp, R.; Georgiou, O. Designing Mid-Air Haptic Gesture Controlled User Interfaces for Cars. Proc. ACM Hum.-Comput. Interact. 2020, 4, 1–23. [Google Scholar] [CrossRef]
Yu, H.; Fan, X.; Zhao, L.; Guo, X. A novel hand gesture recognition method based on 2-channel sEMG. Technol. Health Care 2018, 26, 205–214. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhao, L.; Li, S. Object Detection Algorithm Based on Improved YOLOv3. Electronics 2020, 9, 537. [Google Scholar] [CrossRef] [Green Version]
Kulikajevas, A.; Maskeliūnas, R.; Damaševičius, R.; Ho, E.S.L. 3D Object Reconstruction from Imperfect Depth Data Using Extended YOLOv3 Network. Sensors 2020, 20, 2025. [Google Scholar] [CrossRef] [Green Version]
Neto, P.; Pereira, D.; Pires, J.N.; Moreira, A.P. Real-time and continuous hand gesture spotting: An approach based on artificial neural networks. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013. [Google Scholar] [CrossRef] [Green Version]
Mujahid, A.; Awan, M.J.; Yasin, A.; Mohammed, M.A.; Damaševičius, R.; Maskeliūnas, R.; Abdulkareem, K.H. Real-Time Hand Gesture Recognition Based on Deep Learning YOLOv3 Model. Appl. Sci. 2021, 11, 4164. [Google Scholar] [CrossRef]
Strezoski, G.; Stojanovski, D.; Dimitrovski, I.; Madjarov, G. Hand Gesture Recognition Using Deep Convolutional Neural Networks. In ICT Innovations 2016; Springer International Publishing: Cham, Switzerland, 2017; pp. 49–58. [Google Scholar] [CrossRef]
Gao, X.; Zhang, J.; Wei, Z. Deep learning for sequence pattern recognition. In Proceedings of the 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), Zhuhai, China, 27–29 March 2018. [Google Scholar] [CrossRef]
Elmezain, M. Hand Gesture Spotting and Recognition Using HMM and CRF in Color Image Sequences. Ph.D. Thesis, Otto-von-Guericke-Universitaet, Magdeburg, Germany, 2010. [Google Scholar]
Elmezain, M.; Al-Hamadi, A.; Niese, R.; Michaelis, B. A Robust Method for Hand Tracking Using Mean-shift Algorithm and Kalman Filter in Stereo Color Image Sequences. World Academy of Science, Engineering and Technology, Open Science Index 35. Int. J. Electron. Commun. Eng. 2009, 35, 2151–2155. [Google Scholar]
Elmezain, M.; Al-Hamadi, A.; Michaelis, B. A Novel System for Automatic Hand Gesture Spotting and Recognition in Stereo Color Image Sequences. J. WSCG 2009, 17, 89–96. [Google Scholar]
Elmezain, M.; Al-Hamadi, A.; Michaelis, B. A Robust Method for Hand Gesture Segmentation and Recognition Using Forward Spotting Scheme in Conditional Random Fields. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010. [Google Scholar] [CrossRef]
Elmezain, M.; Al-Hamadi, A.; Sadek, S.; Michaelis, B. Robust methods for hand gesture spotting and recognition using Hidden Markov Models and Conditional Random Fields. In Proceedings of the 10th IEEE International Symposium on Signal Processing and Information Technology, Luxor, Egypt, 15–18 December 2010. [Google Scholar] [CrossRef]

Figure 1. The hand postures for the letters (A–E) are represented above, while example gestures are shown below.

Figure 2. (a) Source image; (b) depth images with normalization; (c) 3D depth images with with normalization; (d) skin detection according to a depth value of up to 10 m for the top image. The bottom image depicts skin pixel detection with no noise (a depth value between 30 cm and 200 cm). The identification of skin pixels is indicated by the yellow color [29].

Figure 3. The cluster trajectory of the gestures for ‘3’ and ‘5’ with respect to to their combined features

(L c, L s c, θ_{1}, θ_{2}, θ_{3}, V)

.

Figure 3. The cluster trajectory of the gestures for ‘3’ and ‘5’ with respect to to their combined features

(L c, L s c, θ_{1}, θ_{2}, θ_{3}, V)

.

Figure 4. Deep Neural Network for recognizing hand gestures for numbers ‘0’ to ‘9’.

Figure 5. Key gesture spotting roadmap using HMMs.

Figure 6. Straight-line segmentation with respect to number gestures 0–9: (a) segmentation parts for each gesture reference and (b) representation of gesture for ‘4’ using left–right banding with five segmented lines.

Figure 7. (a) An ergodic model and (b) a simplified ergodic model with fewer transitions and two dummy states.

Figure 8. Non-gesture (garbage) model that includes two dummy states (ST and ET); the dotted arrows show null transitions, and

G_{i, j}

represents the state j with respect to number gesture i.

Figure 8. Non-gesture (garbage) model that includes two dummy states (ST and ET); the dotted arrows show null transitions, and

G_{i, j}

represents the state j with respect to number gesture i.

Figure 9. Network for spotting ten gestures using LRB topology shown with non-gesture model.

Figure 10. Main structure for hand gesture spotting using the

D P

value.

Figure 10. Main structure for hand gesture spotting using the

D P

value.

Figure 11. Block diagram illustrating how sliding windows operate.

Figure 12. (a) Gesture spotting accuracy for various sizes of

S w

from 1 to 8; (b) comparison of three types of errors (insertion, deletion, and substitution) according to various sizes of

S w

.

Figure 12. (a) Gesture spotting accuracy for various sizes of

S w

from 1 to 8; (b) comparison of three types of errors (insertion, deletion, and substitution) according to various sizes of

S w

.

Figure 13. The progression of probability over time for the hand gestures “Gesture2”, “Gesture3”, “Gesture6”, and “Non-gesture”.

Figure 14. Average segmentation times for forward and backward spotting methods.

Table 1. Results of meaningful gesture spotting using HMM and recognition using DNN at

S w =

5.

Table 1. Results of meaningful gesture spotting using HMM and recognition using DNN at

S w =

5.

Gesture Path	Train Data	Test Data	Key Gestures Spotting Outcomes
Gesture Path	Train Data	Test Data	I	D	S	Correct	Rec. (%)
‘0’	60	28	2	1	2	25	89.29
‘1’	60	28	0	1	1	26	92.86
‘2’	60	28	0	0	1	27	96.43
‘3’	60	28	0	0	0	28	100.00
‘4’	60	28	0	0	1	27	96.43
‘5’	60	28	0	0	1	27	96.43
‘6’	60	28	1	1	1	26	92.85
‘7’	60	28	0	0	0	28	100.00
‘8’	60	28	0	0	1	27	96.43
‘9’	60	28	0	1	0	27	96.43
Total	600	280	3	4	8	268	95.71

Table 2. Results of meaningful gesture spotting using HMM and recognition using DNN with various

S w

ranging from 1 to 8.

Table 2. Results of meaningful gesture spotting using HMM and recognition using DNN with various

S w

ranging from 1 to 8.

			Spotting Key Gestures Results
$S w$	Train Data	Test Data	Error Types			Spotting (%)
			I	D	S	Rec.	Rel.
1	600	280	10	18	30	82.86	80.00
2	600	280	7	15	28	84.64	82.85
3	600	280	5	7	13	92.86	91.23
4	600	280	3	7	13	92.86	91.87
5	600	280	3	4	8	95.71	94.70
6	600	280	3	7	10	93.93	92.93
7	600	280	4	6	11	93.93	92.61
8	600	280	5	6	12	93.57	91.93

Table 3. A comparison between our method and our previous work for the same dataset.

Method	Classifier	Spotting Type	Recognition
Our method	HMM+DNN	Forward	94.70%
Elmezain et al. [32]	CRF	Forward	90.49%
Elmezain et al. [33]	HMM	Forward	93.91%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Elmezain, M.; Alwateer, M.M.; El-Agamy, R.; Atlam, E.; Ibrahim, H.M. Forward Hand Gesture Spotting and Prediction Using HMM-DNN Model. Informatics 2023, 10, 1. https://doi.org/10.3390/informatics10010001

AMA Style

Elmezain M, Alwateer MM, El-Agamy R, Atlam E, Ibrahim HM. Forward Hand Gesture Spotting and Prediction Using HMM-DNN Model. Informatics. 2023; 10(1):1. https://doi.org/10.3390/informatics10010001

Chicago/Turabian Style

Elmezain, Mahmoud, Majed M. Alwateer, Rasha El-Agamy, Elsayed Atlam, and Hani M. Ibrahim. 2023. "Forward Hand Gesture Spotting and Prediction Using HMM-DNN Model" Informatics 10, no. 1: 1. https://doi.org/10.3390/informatics10010001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Forward Hand Gesture Spotting and Prediction Using HMM-DNN Model

Abstract

1. Introduction

2. Related Work

3. Pre-Processing and Feature-Based Tacking

4. Deep Neural Network

5. Spotting and Prediction Approach

5.1. Spotting with HMMs

5.2. Gesture Model

5.3. Non-Gesture Model

5.4. Gesture Spotting Network

5.5. Spotting and Recognition

6. Experimental Results and Discussion

7. Evaluation

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI