3D Convolutional Neural Networks for Dynamic Sign Language Recognition

3D Convolutional Neural Networks for Dynamic Sign Language Recognition Abstract Automatic dynamic sign language recognition is even more challenging than gesture recognition due to the fact that the vocabularies are large and signs are context dependent. Previous works in this direction tend to build classifiers based on complex hand-crafted features computed from the raw inputs. As a type of deep learning model, convolutional neural networks (CNNs) have significantly advanced the accuracy of human gesture classification. However, such methods are currently used to treat video frames as 2D images and recognize gestures at the individual frame level. In this paper, we present a data driven system in which 3D-CNNs are applied to extract spatial and temporal features from video streams, and the motion information is captured by noting the variation in depth between each pair of consecutive frames. To further boost the performance, multi-modal of video streams, including infrared, contour and skeleton are used as input for the architecture and the prediction results estimated from the different sub-networks were fused together. In order to validate our method, we introduce a new challenging multi-modal dynamic sign language dataset captured with Kinect sensors. We evaluate the proposed approach on the collected dataset and achieve superior performance. Moreover, our method achieves a mean Jaccard Index score of 0.836 on the ChaLearn Looking at People Gesture datasets. 1. INTRODUCTION As an important bridge for the communication gap between deaf and normal people, gesture recognition has drawn increasing attention from researchers due to its growing potential in areas such as human–computer interaction. However, there are still numerous challenging problems especially in the video-based applications for the dynamic sign language recognition because it is difficult to capture the information from hand-shape and spatial–temporal trajectories of the upper limb. The first problem is to detect and track the moving limbs. Due to the impacts of environmental factors such as complex background and different lighting conditions, gesture segmentation has always been a difficult task. In addition, there may exist occlusions sheltering in the duration of a gesture, and it can produce various complex sign language vocabularies because human’s limbs have certain attributions of deformation. Second, it is a nontrivial challenge to integrate the temporal and spatial features together. There was a great deal of flexibility in spatial dimension for human hands, thus the gestures split from a common sign language vocabulary always differ from each other due to the diverse styles of expression habits of performers in trajectory, amplitude, direction and position. Even if a gesture is repeated several times by the same person, the speed and the amplitude of each movement could not be exactly equal. Moreover, observing the same gesture from different viewpoints also results in contrary appearances. Finally, the sign language recognition systems based on computer vision involve a large amount of data during video preprocessing, and previous approaches tend to use multi-dimensional parameters to extract features from gestures to ensure a better performance. This presents the challenge of detecting and classifying gestures immediately upon or before their completion to provide rapid feedback [1]. Hand-crafted features have dominated objection recognition and image classification tasks in the past many years. Previous studies of gesture recognition focused mainly on adapting hand-crafted features. These approaches typically include two stages: first, building a detector to obtain the spatial–temporal features from raw video frames and the second is to train the classifiers based on the obtained features. Different feature detection methods such as Hessian3D [2] and Cuboids [3] have been widely used. For classifiers, popular methods are conditional random fields [4], hidden Markov models (HMM) [5] and support vector machines (SVM) [6]. It is well known that features are important for the task since the choice of features is highly problem-dependent. Especially, for dynamic sign language recognition, different gesture vocabularies may appear dramatically different in terms of their appearances and motion patterns. However, feature learning is not a part of such classifiers discussed above and needs to be performed separately to extract features such as edges, gradients, pixel intensities and object shapes. In recent years, deep learning has made a tremendous impact on pattern recognition, computer vision and some other fields. Deep neural network can extract a hierarchy of features by building high-level features from low-level ones. Such architectures can be trained using either supervised learning or unsupervised learning, and obtain high-performance level which was previously unattainable on the tasks of image classification [7] and human action recognition [8]. The convolutional neural networks (CNNs) are a type of supervised learning in which interleaved trainable filters and local neighborhood pooling operators act on the raw input images, resulting in a hierarchy of progressively complex features. It has been shown that CNNs can be invariant to certain variations such as surrounding clutter lighting and pose [9]. As a class of effective deep neural network architectures for automated feature construction, CNNs have been primarily applied to 2D images and such works does not take into account the motion information encoded in multiple contiguous frames [10]. In order to effectively incorporate the motion information into video analysis, we consider the use of 3D-CNNs for dynamic sign language recognition and perform 3D convolutions in the convolutional layers so that the discriminative features encoded in multiple adjacent frames along both the spatial and temporal dimensions are captured. Where in fact, approaching these models to solve problems involves the understanding of continuous frame content which is still in its early stage. There are some recent works [11] which explored the applications of this idea. The challenges are lack of sufficiently large datasets and the high cost of labeling data in many actual application scenarios, as well as the increased modeling computational complexity originated from the mutual connection information among various continuous video frames. Due to the widespread usage of the Microsoft Kinect, there has been an outpouring in relevant methods for human action and gesture recognition. Several public new datasets [12, 13] have brought researchers conveniences in designing novel algorithms which rely on multi-modal data. However, most of the existing public datasets lack of effective and accurate labeling or are stored in a single data format. Given the limitations of public available datasets for multi-modal dynamic gesture recognition, we present a novel upper limb labeled dataset of 20 classes to train a deep neural network for dynamic sign language recognition task. Each class was intended for human–computer interfaces in the real-world environment of museum and was recorded by Kinect sensors. We also present the classification result of our approach on the ChaLearn 2014 [14] data as the publicly available dataset to evaluate the performance of the gesture recognition, and our model achieves competitive performance. The major contributions of this paper can be summarized as follows: We collected a novel Sign Language Video in Museums dataset, which we refer to as SLVM.1 The SLVM contains approximately 7K manually labeled sign performances in continuous video sequences, which belongs to 20 categories of commonly used sign language in real-world environment of museums in China. A novel 3D CNN for dynamic sign language recognition is proposed. This architecture can extract both the spatial and temporal features from raw video data automatically without any prior knowledge, thus capture motion information from dynamic gestures. Compared with previous work has train a 3D CNN for gesture recognition task, our model is deeper and iteratively integrated discriminative data representations from multi-modal data. Furthermore, fine-tuned the architectures on the pre-trained model is adopted and considered as an optional skill to prevent over-fitting. An effective fusion strategy for sub-networks with different input modalities is proposed, which can improve the performance of the provided architecture to compensate the errors of the separate classifiers, thus ensure the robustness of the network to missing signals. The remainder of this work is organized as follows. Section 2 introduces the related works about human action recognition and recent works on CNNs applied on gesture recognition. Section 3 describes the multi-modal data collection and the data preprocessing work. In Section 4, the details of our network model are presented. Finally, Section 5 presents experimental analysis and our final remarks. 2. RELATED WORKS Gesture recognition has been studied for decades. Traditional approaches to vision-based gesture recognition from video typically include sparse or dense extraction of spatial or spatio-temporal engineered descriptors followed by classification [15]. This work addressed more general aspects of hand-crafted feature-based methods and deep learning-based methods. 2.1. Hand-crafted feature-based methods In the early stage, many hand-crafted features for effective video analysis have been introduced in the area of sign language and action recognition. Typically the histograms of oriented gradients or optical flow are captured, and then hidden Markov model, finite-state machine and dynamic time warping as classifiers are commonly applied. Starner et al. [5] introduced an HMM-based system which uses color cameras as sensors to track unadorned upper limb in real time and interpret American Sign Language with a vocabulary of 40 words using HMMs. Zaki et al. [16] presented another recognition system of American Sign Language with a lexicon of 30 words. They constructed appearance-based representations and an upper limb tracking system which uses HMM as the classifier and achieves an accuracy rate of 89.09% on the RWTH-BOSTON-50 dataset which includes 50 isolated signs. Ohn-Bar et al. [12] evaluated various hand-crafted features and classifiers for in-car dynamic gesture recognition with RGB-D data. They obtain the state-of-the-art performance with a combination of histogram of gradient features and an SVM classifier. Artificial neural network is another broadly used classifier for human action recognition tasks. A sign language recognition system based on 3D Hopfield neural networks is presented in [17], which included a vocabulary of 15 different hand gestures and achieved 91% recognition rate. Kouichi et al. [18] proposed a neural network to recognize a finger alphabet of 42 symbols. Kim et al. [19] took 3D coordinates and angles of hands as an input and used a fuzzy min–max neural network for hand gesture recognition. This method resulted in an accuracy rate of 85% on 25 isolated gestures. However, hand-crafted features-based methods cannot take all factors into consideration at the same time, most these approaches either employ pre-segmented video sequences or treat detection and classification as separate problems. 2.2. Deep learning-based methods In contrast to the traditional approaches that rely on the construction of hand-crafted features, there is a growing trend toward feature representations learned by deep neutral networks. Karpathy et al. [20] proposed a CNN-based model to classify large-scale video operating at two spatial resolutions. Zisserman et al. [11] increase the amount of training data by multi-modal learning and extract the spatial concurrent with temporal features by two-stream CNNs. The key of these approaches is to learn spatial and temporal features consecutively. 3D convolution has been applied to video stream analysis and classification in recent years. Tran et al. [21] proposed a 3D-CNN to analyze a series of short video clips and average the network’s responses for all clips. Molchanov et al. [1] applied 3D-CNN on the whole video sequence and introduced space–time video augmentation techniques to avoid over-fitting, which motivated our work. The proposed 3D-CNN is based on the spatio-temporal features extracted from video data. It achieved significant performance improvements compared to hand-crafted features as baselines and also obtained promising accuracy on large-scale datasets, recently many approaches are on the basis of it like [22, 23]. Moreover, two-stream-based frameworks have obtained remarkable performances for human action recognition. Duan et al. [24] combined a convolutional two-stream consensus voting network and a 3D ConvNet for dynamic gesture recognition. Their method obtained the state-of-the-art performances on the RGB-D HuDaAct datasets. Another obvious approach is multi-modal fusion. Neveroa et al. [25] proposed to combine color image from hand regions and a pose descriptor consisting of seven subsets for upper-body skeleton features to recognize gestures. Wudi et al. [26] presented a system for multi-modal gestures, which consists of two deep neural networks: a CNN to manage and fuse batches of depth and RGB data, and a Gaussian–Bernouilli Deep Belief Network (DBN) to handle skeletal dynamics. In the follow-up work [27], they demonstrated that multi-modal fusion of the different inputs results in a clear improvement over unimodal approaches due to the complementary nature of the different input modalities. In order to take full use of the advantages of CNNs, we propose a multi-scale 3D-CNN to integrate multi-modal visual data for sign language detection and recognition. Each type of data stream provides several adjacent frames as input. Our main innovation is a new training strategy that initializes the sub-components of individual modalities carefully and then fine-tuned the model to prevent over-fitting based on the pre-trained model. Moreover, we mixed the individual results together to improve the performance of the proposed model. In the experiments, we validate our proposed deep network make a clear performance improvement compared with unimodal approaches. 3. DATA COLLECTION AND PREPROCESSING 3.1. Data collect by kinect Table 1. Evaluation of different approaches on the SLVM dataset. Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  View Large Table 1. Evaluation of different approaches on the SLVM dataset. Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  View Large In this section, we present our experience of using Kinect sensors pre-recorded video samples of dynamic sign language dataset. We have developed a record system based on Microsoft Kinect for Windows SDK V2 to capture multi-modal data, we called it ‘Sign Language Recorder’.2 Each gesture object obtained the infrared and contour image sequences, and sources video was recorded at 30 FPS rate stored with resolution 512*424 in MP4 file format. In addition, skeletal data are also contribute to the network’s prediction by tracking the trajectory of hand which are crucial for discrimination in different sign language classes performed in similar body poses. The multi-modal data include 6800 samples which constitute a complete vocabulary of 20 Chinese sign language categories performed by 17 different users, and each gesture in the dataset is accompanied by a ground truth label as well as information about its start and end frames. The storage structure of multi-modal data is described in Fig. 1, taking contour data as an example: the camera views the physical space in front of the Kinect and generates a contour image of human in the plane towards the sensor, this contour image is then used for background removal, followed by generation of the depth profile of the human body. In order to improve the computational efficiency to ensure the real-time communication between human and computer, we abandoned the traditional methods [20] of using higher-dimensional data and instead applied infrared and contour data which consist of single channel to train the deep neural networks respectively. In addition, skeleton data are used to precisely track the trajectory of hands. Figure 1. View largeDownload slide Overview of our method on an example from the 2017 Sign Language Video in Museums (SLVM) dataset. Figure 1. View largeDownload slide Overview of our method on an example from the 2017 Sign Language Video in Museums (SLVM) dataset. 3.2. Sign language temporal segmentation Each vocabulary of sign language in the SLVM dataset has a different duration. In order to normalize the temporal lengths of the sign language, we first re-sampled each sign language vocabulary to 32 frames using sliding-window gesture detector by dropping or repeating frames. Dynamic sign language generally consists three temporally overlapping phases ( P1, P2, P3) as: preparation, nucleus and retraction [22]. The preparation and retraction regarded as beginning and ending action can be quite similar for different gestures and hence deemed as less useful or even detrimental to accuracy. As illustrated in Fig. 2, for sign language ‘OK’, the gesture sequence can be essentially into P1: ready to lift arm to the specified level, P2: the motions with OK gesture, P3: slowly to lay down the arm. Figure 2. View largeDownload slide Sliding windows for gesture temporal segmentation. Figure 2. View largeDownload slide Sliding windows for gesture temporal segmentation. More precisely, since our goal is to capture the variation in the demonstration process of sign language, this motivates us to train a binary classifier which principally rely on the nucleus phase to distinguish between stationary time and active period. The classifier is a two-layer fully connected neural network taking the articulated pose descriptor as an input. All training frames having a sign language word label are used as positive examples, while frames that lie before and after such labeled objects are considered as negatives. Each frame is thus annotated with a label ‘static’ or ‘dynamic’ correctly. By using this classifier, we can predict beginning and ending frames of every sign language in the video samples in the following way as described in Algorithm 1. Algorithm 1 The approach for Sign Language Temporal Segmentation Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  Algorithm 1 The approach for Sign Language Temporal Segmentation Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  3.3. Preprocessing There will be a high demand for computational complexity if training directly with raw videos captured by Kinect, which are 512*424 pixels sequence images. Therefore, before inputting the data to deep neural network, we preprocess all videos by first cropping upper body and highest limb using the captured skeleton information from dataset. Furthermore, the background removal using median filtering and noise reduction with threshold method [28] were applied to the cropped data. Contour frames are normalized to zero mean, and infrared frames are only normalized by the image standard deviation. As shown in Fig. 3, the final inputs to sign language classifier contain four video samples: upper body with infrared frames, upper limb with infrared frames, upper body with contour frames, and upper limb with contour frames. All of the frame sources are of 32*64*64 pixels (32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension). Figure 3. View largeDownload slide Overview of our 3D-CNN classifier for sign language recognition. Figure 3. View largeDownload slide Overview of our 3D-CNN classifier for sign language recognition. 4. MODELS In this section, we present our architecture of 3D convolution neural networks for multi-modal dynamic sign language detection and recognition task. Dropout and regularization techniques are used to reduce over-fitting. 4.1. 3D convolution Convoltional Neural Networks are biologically-inspired variants of Multi-layer perception. The idea is that a filter window called local receptive field moves over each unit from previous layer. In 2D CNNs, each unit in the convolutional layer receives inputs from a set of units located in the filter window and calculated by the following equation:   ϒxy=f(b+∑i=0n∑j=0nwijv(x+i)(y+j)) (1)where f(•) is the neural activation function like Sigmoid, Tanh or Relus, ϒxy is a unit in feature map at position (x,y), v(x+i)(y+j) is input unit at position (x+i,y+j), wij is a (n×n) array of shared weights and b is the shared value for the bias of the feature map. In 2D CNNs architecture, convolutions and pooling are applied to the 2D feature maps to calculate features from the spatial dimensions only. When applied to sign language recognition problems based on video analysis, it is also desirable to capture motion information encoded in multiple consecutive frames along with spatial features. For this reason, we propose to perform 3D operators in the convolution and pooling stages so that discriminative features from both spatial and temporal dimensions are captured. The 3D convolution is achieved by convolving a 3D kernel to the cube of temporal formed by stacking multiple contiguous frames together. With this structure, feature maps in the convolution layer are connected to multiple successive frames in the previous layer, thereby capturing motion information. Similar to 2D convolution operator, 3D convolution is calculated by the following equation:   ϒxyt=f(b+∑i=0n∑j=0n∑k=0mwijkv(x+i)(y+j)(t+k)) (2)where wijk is the (i,j,k)th value of the kernel connected to feature map in previous layer, m is the size of the 3D kernel along the temporal dimension. The motion information in multiple contiguous frames is captured by this setting. Since the weights of kernel are replicated throughout the entire cube, the 3D convolution kernel can only extract the same type of features from frame cube. The fundamental principle of design CNNs architecture is that the number of feature maps at next layers should be increased so that multiple types of features from the same set of lower level feature maps can be generated. Similar to the 2D convolution case, this can be achieved by using multiple 3D convolution with distinct kernels to the same location in the previous layer. 4.2. Our 3D-CNN architecture for sign language recognition Inspired by the framework successfully applied to human action recognition [8], the CNNs architectures suitable for sign language recognition task can be devised based on the design principle of the 3D convolution described above. Our proposed model is a data driven learning system that consisted of two sub-networks: an Infrared-classifier network (I-CN) with parameter ΘI and a Contour-classifier network (C-CN) with parameter ΘC. On top of that, the dimensionality reduction methods and complicated preprocessing is naturally embedded in the framework. We use Microsoft Kinect to collect the SLVM dataset. The dataset includes infrared stream, contour stream and body skeleton data which were used to track the trajectory of limbs simultaneously. Unlike other traditional approaches that used color images consisting of RGB three channels, the infrared and contour images just contain single channel. The complete architecture is depicted in Fig. 3, for each type of visual source we consider 32 frames of size 64*64 centered on the current frame as input to the 3D-CNN. This results in four feature maps denoted by infrared-Body, contour-Body, infrared-Hands, contour-Hands. Each feature map contains 32 stacked frames from the corresponding channel as a cube in temporal dimension. Multiple source inputs usually lead to better performance compared to single format data source. Our deep neural network consists of 11 layers including the input layer. After the input layer, the next convolution layer consists of 16 feature maps produced by 5*5*5 spatial and temporal 3D convolutional kernels, followed by local contrast normalization (LCN) [29]. A 3D max sub-sampling with strides (2,2,2) is then applied. The second convolution layer used 32 feature maps with 5*5*5 3D kernels followed by 3D sub-sampling with strides (2,2,2). The third convolution layer is composed of 48 feature maps with 4*4*4 3D kernels followed by 3D max sub-sampling with strides (2,2,2). There were further followed by three fully connected softmax classifier with 512 neurons in their hidden layer, which produced class-membership probabilities P(C∣xI,ΘI) and P(C∣xC,ΘC) for the 20 sign language classes. Rectified linear unit (ReLUs) activation functions were used to speed up training in the architecture except for softmax layers. We computed the output of the softmax layers as   P(C∣x,Θ)=softmax(xk)=exp(xk)∑i=1kexp(xi) (3)where xi was the output of the neuron i. Algorithm 2 CNN-based classifier for sign language recognition Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  Algorithm 2 CNN-based classifier for sign language recognition Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  4.3. Initialization We initialized the weights of the fully connected hidden layers and the softmax layer with random samples from a normal distribution with μ=0 and σ=0.03. In order to have a non-zero partial derivative during training, except for the softmax layer, the biases for all layers have the fixed initial value of 1. The biases in the softmax layer were set to 0. The weights of the 3D convolution layers are randomly initialized from a uniform distribution between [−Wb,Wb], where Wb=6/(ni+no), and ni, no represent the number of neurons in the input and output layers, respectively. 4.4. The process of optimization The process of training CNNs architecture involves the optimization of the networks parameters Θ to minimize a cost function for the dataset D and fine-tune by back-propagation algorithm [9]. We selected negative log-likelihood as the cost function:   L(Θ,D)=−1∣D∣∑i=0∣D∣log(P(C(i)∣x(i),Θ)) (4) We implemented optimization by virtue of stochastic gradient descent with mini-batches of size 16 appropriately for the I-CN and the C-CN. Nesterov’s accelerated gradient descent (NAG) [30] with a fixed momentum coefficient of 0.9 is also used to update the network’s parameters θ∈Θ, at every iteration t as:   ▽f(θt)=⟨δLδ(θt−1)⟩batchmt+1=κmt−ε∇f(θt+κmt)θt+1=θt+mt+1 (5)where ▽f(θt) is the value of gradient of the log-likelihood cost function with respect to the parameter θt averaged across the mini-batch, mt is the momentum, κ is the momentum coefficient, ε is the learning rate initialized at 0.07 dropped to its 1/10 every 6500 iterations if the cost function did not improve by more than 10% [31]. Weight decay usually led to better generalization for sign language from different subjects. We observed that the NAG will converge faster than stochastic gradient descent with only momentum in our experiments described in Section 5. 4.5. Regularization and dropout In order to reduce over-fitting, we applied regularization which was regarded as a key component to successful generalization of the network. During training, the joint learning objective can be generally formulated as follows:   L=L0+λ2n∑ww2 (6)where the first term represents negative log-likelihood described above in Equation (4), the second term is L2 regularization. Optimizing the joint learning objective can be viewed as a way of compromising process between minimizing the original cost function and finding small weights depending on the value of λ. We also used dropout technique in our model, which has been very successful in improving the performance of deep neural networks [32]. During dropout, the outputs of the fully connected hidden layer was randomly set to 0 with probability ρ=0.5, the network will learn a set of weights and biases by repeating the process, consequently, this will delete half of the neurons in the back-propagation step of the iteration training. To compensate for that, the weights of following the dropped layer were multiplied by 2 in the forward propagation stage. 4.6. Multi-modal fusion CNNs have been shown to be more effective while combining data from different modalities [33]. Assume that the weights of the two sub-networks are pre-trained, the next important issue is to determine a fusion strategy. Our fusion model combines the class-membership probabilities estimated from the separate single-modality networks by linear combination to compute the final probabilities for the sign language classifier:   P(C∣x)∝α*P(C∣xI,ΘI)+(1−α)*P(C∣xC,ΘC) (7) Here, the different probabilities are provided by the sub-networks described in Section 4.2, α represents the coefficient to control the contributions of each stream and its value is optimized through cross-validation. Usually, the best value of a is very close to 0.5, indicating that both separate networks with different input modalities are equally important. Following this strategy, we combine the two results together at the output layer of our model and finally predict the class label as: c*=argmaxP(C∣x) . 4.7. GPU acceleration A potential concern of training deep neural networks is time consuming, it usually costs several weeks or months to train a CNN with million-scale in large-scale video dataset. Fortunately, it is still possible by using a GPU-based calculation with the help of CUDA-Convnet for parallel processing to achieve real-time efficiency. We train our sign language classifier with Python libraries Theano [34], and PyLearn2 [35] for the fast implementation. Experiments were conducted on one modern machine with an eight-core processor (Intel Core i7-6700K), 32GB SDRAM and a NVIDIA GeForce GTX 1070 GPU with 11264MB of memory. The training speed our 3D-CNN model was three to five times faster than before by using GPU, and the training time per epoch is about 2100 s, this allows us to complete the training in 3 days. 5. EXPERIMENT AND RESULTS This section reports how our strategies work by illustrating groups of experiments. First, SLVM and ChaLearn Looking at People Gesture datasets for our experiments will be introduced briefly; then present the experimental protocol we followed and comparison with previous methods. 5.1. Experiments on SLVM dataset As mentioned before in Section 3.1, most of the existing datasets lack of effective and accurate labeling or stored in a single data format. Given the limitations of existing datasets, we built a publicly available sign language dataset by ourselves. SLVM has 20 vocabularies that are widely used during visit to the museum. Each sign is played by 17 different users, every signer repeated 20 times for each word, thus each gesture has 340 samples, and we have 6800 samples in total. The infrared image and the contour image were recorded simultaneously via Kinect sensor. The experiments were performed using a 7-fold cross-validation approach to evaluate the various settings of the proposed framework, all the samples are divided into three mutually exclusive subsets (5100 training, 850 validation and 850 test instances), no sample performed by the same person appears both in the training and testing subsets. In the training phase, the training and validation datasets were used. In order to maximize generalization performance of the framework, some form of model selection [36] is required to pick optimal parameters based on the equal accuracy rate for the validation set. The hyper-parameters ϑ=( ρ, λ, ε), where ρ is the dropout probability and λ is the regularization parameter controlling the bias variance trade-off described above in Section 4.5. We train the proposed deep architecture from scratch. Therefore, higher learning rates are needed [1]. The initial learning rate ε is set to 0.07 and dropped to its 1/10 every 6500 iterations if the cost function did not improve by more than 1/10. We further using the rest samples constitute the testing data and calculate the accuracy as the final fitness value, where the sequences were held out for testing while all other sequences were used for training and validation. The classification accuracies of the training process are shown in Fig. 4, the 98 epoch are sufficient to reach the lowest test error, where the best training and prediction accuracies are 95.8% and 89.2% respectively. Figure 4. View largeDownload slide Classification accuracies of our method on SLVM dataset. Figure 4. View largeDownload slide Classification accuracies of our method on SLVM dataset. Our most notable experiments are the models with ReLUs and LCN. Furthermore, dropout is used as main approaches to reduce over-fitting. As shown in Table 1, neural networks with ReLUs non-linearities have been prove to be with an improvement of 5.5% respect to Tanh units, possibly because the ReLUs net proved faster to train than standard Tanh units. LCN added in the first two layers of the proposed deep architecture, also obtains more than 2% improvement in single modalities. Dropout is a technique that provides a way of approximately combining exponentially many different networks efficiently and prevents over-fitting [32]. The results of comparisons are shown that it typically improves the performance by at least 2.4% of proposed deep architectures, even when no data augmentation is used. Miao et al. [23] use three different modalities and combine the features extracted by them to boost the final performance to a large extent. This research has motivated our work to employ a fusion scheme for higher recognition rate. As can be observed from both performance measures and the accuracy rate, individually, infrared module data ( accuracy=88.3%) usually performs better than the contour module ( accuracy=87.6%). This result makes sense because information in the infrared data is more robust to indoor lighting change and it can preclude the background noise more easily. On the contrary, the contour images tend to ignore details of the hand-shape. Another interesting conclusion from Table 1 is that using fusion scheme described in Section 4.6 to combine infrared images with contour modality usually performs better than only using infrared video (the accuracy improved to 89.2%). One possible explanation is that some sign language words have similar trajectories, but the details of the hand-shape and the motion ranges of upper limb are quite different. Therefore, combining both features is a better choice. In order to better understand the performance of the model, we show the confusion matrix in Fig. 5, it tells us which word caused confusion and display the fraction of true positives for each sign language word on the diagonal. We observed that almost all signs are confused, this is because that some signs are easier to be recognized than others. Taking ‘follow me’ as an example, which is expressed in two hands, the motion trajectory of arms is very different from many other gestures. Figure 5. View largeDownload slide Confusion Matrices for our method on SLVM dataset. Figure 5. View largeDownload slide Confusion Matrices for our method on SLVM dataset. At the same time, as the proposed 3D-CNN extract features from 32 consecutive frames, therefore, if the model cannot recognize the class of an input, its predicted results will be uncertain because some gestures are difficult to distinguish from other gestures due to the trajectories of the upper limb are overlapping. 5.2. Experiments on ChaLearn dataset To validate our proposed sign language recognition algorithm, we evaluated our method on benchmark dataset of ChaLearn LAP [13] which was made public for the gesture classification challenge in 2014. The focus is on user independent spotting and multiple samples of gestures, which means learning to detect and recognize gestures from multi-modal instances for each category performed by different people. As illustrated in Table 2, this dataset contains 13 858 RGB-D videos of 20 upper-body Italian conversational sign language gestures performed by 27 different users with variations in surroundings, clothing and lighting, the videos are recorded with a consumer RGB-D sensor. Table 2. ChaLearn 2014 Gesture Dataset Statistics. Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  View Large Table 2. ChaLearn 2014 Gesture Dataset Statistics. Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  View Large Several measures can be used to evaluate the performance of architectures. In this work, we follow the ChaLearn LAP 2014 Challenge score to measure the gesture recognition performance. The competition score is based on the Jaccard Index, which is defined as follows:   J(s,n)=∣A(s,n)∩B(s,n)∣∣A(s,n)∪B(s,n)∣ (8)where A(s,n) is the ground truth label of gesture n in sequence s, and B(s,n) is the prediction of algorithm output for such a gesture at sequence s. The Jaccard Index J(s,n) can be seen as the overlap rate between A(s,n) and B(s,n). To compute the final score, the Jaccard Index is averaged over all gesture classes and all sequences:   J(mean)=1NS∑s=1S∑n=1NJ(s,n) (9)where N=20 is the number of categories and S is the number of sequences in the test set. We use this mean Jaccard Index as the final evaluation criterion. As the algorithm proposed above, we have access to the RGB map and depth map to train our sub-networks respectively. Two different training strategies are utilized to evaluate the proposed architecture: Strategy 1: We train the networks first on the ChaLearn 2014 dataset from scratch. The learning rate is initialized at 0.05 and dropped to its 1/10 every 9700 iterations [1]. Rectified Linear Units, dropout and regularization were used to improve the performance of our network. The final accuracy on the test set is 92.4% and mean Jaccard Index score 0.793, meaning our method can be used to accurately classify different gestures of dynamic sign language. Strategy 2: We fine-tune the networks for ChaLearn 2014 dataset based on the pre-trained models of SLVM. The learning rate is initialized at 0.01 and reduced by a factor of 10 after every 9700 iterations. As illustrated in Fig. 6, we find that fine-tune the networks on pre-trained models led to a significant impact on generalization. The confusion matrix for ChaLearn 2014 dataset reveals that recognition rates across gestures is generally kept at mean accuracy 94.5±1.77% although a single gesture of Buonissimo falling below 90% recognition rate. We observe a validation accuracy of 96.3% for our best model and improve the mean Jaccard Index to 0.836. The result of the experiment show that fine-tune on pre-trained models performed better in recognition rates compare with trained from scratch. So, multi-modalities can also be viewed as an effective data-augment method to prevent over-fitting for relative small datasets (Fig. 7). Figure 6. View largeDownload slide Classification accuracies of our method on ChaLearn 2014 dataset. Figure 6. View largeDownload slide Classification accuracies of our method on ChaLearn 2014 dataset. Figure 7. View largeDownload slide Confusion Matrices for our method on ChaLearn 2014 dataset. Figure 7. View largeDownload slide Confusion Matrices for our method on ChaLearn 2014 dataset. A comparison of results with previous work of the challenge are presented in Table 3, the purpose of this comparison is to explore the relative strengths and weaknesses of different learning representations as well as the nuances of multi-modal fusion. The first three works [37–39] in the table use hand-crafted feature representations that are subsequently classified. This result makes the sense that deep neural network can extract high-level features from raw video stream by building hierarchical architecture. Table 3. Comparison of result on the Chalearn Gesture dataset. Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  View Large Table 3. Comparison of result on the Chalearn Gesture dataset. Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  View Large The study of [28] was based on a 3D-ConvNet that learned spatio-temporal features from RGB-D streams. Compared with this 3D-CNN framework, our model is deeper and iteratively integrated discriminative data representations from multi-modal data. The study of [27] utilizes DBN and 3D-CNN for learning contextual frame-level representations. We also note that it is the only method that incorporates more structured temporal modeling, and this is an excellent framework because an HMM-like approach can be suitable for variable length temporal information fusion. The study of [25] fused multiple modalities at several spatial and temporal scales to ensure the robustness of deep architecture to missing signals in one or several channels, and this work achieves the best mean Jaccard Index score in the competition. It is worth noting that deep neural architecture combining multi-scale features usually achieves better performance. 6. CONCLUSION AND DISCUSSION Dynamic sign language recognition involves anticipated challenges such as temporal variance, spatial complexity and movement epenthesis. Therefore, extract spatio-temporal features by using 3D CNNs become the key to effective dynamic sign language recognition methods. In this paper, we have presented 3D-CNN model for continuous dynamic sign language classification on multi-modal data which consists of infrared data, contour data and skeleton data. Without any prior knowledge, our 3D-CNN model can automatically learn spatio-temporal motion information and use them to recognize the entire dynamic signs. Since our recording dataset composed of three modal, our model integrates distinct strategies for different data: (1) skeleton data used to track the trajectory of upper limbs further improves the accuracy of sign language recognition. (2) In contrast to previous mainstream methods that use three channel RGB image data, we use single channel infrared data to improve computational efficiency. (3) Synchronous contour data are used to compensate for the errors of classification from infrared data. On top of that, we used the late fusion strategy to combine the two classification results from sub-networks together. In order to involve more data for training, fine-tune the architectures on pre-trained model also used as another practical skill to prevent over-fitting. We evaluated our model on a new dataset of dynamic sign language and against other benchmarks, the experiment show that our proposed method achieved the competitive result. There are several directions for future work. For the task of capturing temporal information in video sequences, a simple temporal convolution strategy is not sufficient for dynamic sign language recognition, we will investigate the possibility of building a unified model to make better use of the temporal component of the problem, in which the temporal convolutions can be directly connected to the long short-term memory (LSTM) sequence classifier. Furthermore, the SLVM dataset contains 6800 samples totally, in order to further prevent over-fitting and to improve the generalization performance of the classifier, perform offline and online spatio-temporal data augmentation will be another work for us in the future. Funding This work was supported by the National Key Technology Research Program of the Ministry of Science and Technology of China [grant number: 2015BAK33B02]; and National Natural Science Foundation of China [grant number: 61671483]; and Continuing Education Research Foundation of Southwest University of Science and Technology [grant number: 17JYF01]. Appendix A a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  Appendix A a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  References 1 Molchanov, P., Gupta, S., Kim, K. and Kautz, J. ( 2015) Hand Gesture Recognition with 3d Convolutional Neural Networks. Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops, Boston, USA, June 7–12 , pp. 1–7. IEEE, New Jersey, USA. 2 Willems, G., Tuytelaars, T. and Gool, L.V. ( 2008) An Efficient Dense and Scale-Invariant Spatio-temporal Interest Point Detector. Computer Vision-ECCV 2008, Marseille, France, October 12–18, 2008, pp. 650–663. Springer, Berlin, Heidelberg. 3 Dollár, P., Rabaud, V., Cottrell, G. and Belongie, S. ( 2005) Behavior Recognition via Sparse Spatio-temporal Features. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, October 15–16, 2005, pp. 65–72. IEEE, New Jersey, USA. 4 Wang, S.B., Quattoni, A., Morency, L.P. and Demirdjian, D., Darrell, T. ( 2006) Hidden Conditional Random Fields for Gesture Recognition. IEEE Comput. Soc. Conf. Computer Vision and Pattern Recognition, New York, USA, June 17–22, 2006, pp. 1521–1527. IEEE, New Jersey, USA. 5 Starner, T., Weaver, J. and Pentland, A. ( 1998) Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. , 20, 1371– 1375. Google Scholar CrossRef Search ADS   6 Dardas, N.H. and Georganas, N.D. ( 2011) Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans. Instrum. Meas. , 60, 3592– 3607. Google Scholar CrossRef Search ADS   7 Krizhevsky, A., Sutskever, I. and Hinton, G.E. ( 2012) Imagenet Classification with Deep Convolutional Neural Networks. Int. Conf. Neural Information Processing Systems, Doha, Qatar, November 12–15, 2012, pp. 1097–1105. Springer, Berlin, Heidelberg. 8 Ji, S., Xu, W., Yang, M. and Yu, K. ( 2013) 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. , 35, 221– 231. Google Scholar CrossRef Search ADS PubMed  9 Lecun, Y., Huang, F.J. and Bottou, L. ( 2004) Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting. Computer Vision and Pattern Recognition, 2004. CVPR 2004. Washington, DC, USA, June 27–July 2, 2004, pp. II-97–104 Vol. 2. IEEE, New Jersey, USA. 10 Ning, F., Delhomme, D., Lecun, Y., Piano, F., Bottou, L. and Barbano, P.E. ( 2005) Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process. , 14, 1360– 1371. Google Scholar CrossRef Search ADS PubMed  11 Simonyan, K. and Zisserman, A. ( 2014) Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. , 1, 568– 576. 12 Ohn-Bar, E. and Trivedi, M.M. ( 2014) Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transport. Syst. , 15, 2368– 2377. Google Scholar CrossRef Search ADS   13 Escalera, S.A. ( 2014) Chalearn Looking at People Challenge 2014: Dataset and Results. Workshop Eur. Conf. Computer Vision. Zurich, Switzerland, September 6–12, 2014, pp. 459–473. Springer, Berlin, Heidelberg. 14 Shu, Z., Yun, K. and Samaras, D. ( 2014) Action Detection with Improved Dense Trajectories and Sliding Window. Eur. Conf. Computer Vision, Zurich, Switzerland, September 6–12, 2014, pp. 541–551. Springer, Berlin, Heidelberg. 15 Ming, J.C., Omar, Z. and Jaward, M.H. ( 2017) A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. Cybern. , 1, 1– 23. 16 Zaki, M.M. and Shaheen, S.I. ( 2011) Sign language recognition using a combination of new vision based features. Pattern Recognit. Lett. , 32, 572– 577. Google Scholar CrossRef Search ADS   17 Huang, C.L., Huang, W.Y. and Lien, C.C. ( 1995) Sign Language Recognition Using 3-d Hopfield Neural Network. Int. Conf. Image Processing, 1995. Proceedings, Washington, USA. October 23–26, 1995, pp. 611–614, Vol. 2. IEEE, New Jersey, USA. 18 Murakami, K. and Taguchi, H. ( 1991) Gesture Recognition Using Recurrent Neural Networks. Conf. Human Factors in Computing Systems, New Orleans, LA, USA, April 27–May 2, 1991, pp. 237–242. ACM, New York, USA. 19 Jong-Sung Kim, W.J. and Bien, Z. ( 1996) A dynamic gesture recognition system for the Korean sign language (ksl). IEEE Trans. Syst. Man Cybern. B , 26, 354– 359. Google Scholar CrossRef Search ADS   20 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R. and Fei-Fei, L., ( 2014) Large-Scale Video Classification with Convolutional Neural Networks. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, June 24–27, 2014, pp. 1725–1732. IEEE, New Jersey, USA. 21 Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M., ( 2014) Learning spatiotemporal features with 3d convolutional networks. pp. 4489–4497. 22 Zhu, G. and Zhang, L. ( 2017) Large-Scale Isolated Gesture Recognition Using Pyramidal 3d Convolutional Networks. Int. Conf. Pattern Recognition, Dhaka, Bangladesh, February 13–14, 2017, pp. 19–24. IEEE, New Jersey, USA. 23 Li, Y. and Miao, Q. ( 2017) Large-Scale Gesture Recognition with a Fusion of rgb-d Data based on the c3d Model. Int. Conf. Pattern Recognition, Dhaka, Bangladesh, February 13–14, 2017, pp. 25–30. IEEE, New Jersey, USA. 24 Duan, J., Zhou, S., Wan, J., Guo, X. and Li, S.Z. ( 2016) Multi-modality fusion based on consensus-voting and 3d convolution for isolated gesture recognition. pp. 44–49. 25 Neverova, N., Wolf, C., Taylor, G. and Nebout, F. ( 2014) Moddrop: adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. , 38, 1692– 1706. Google Scholar CrossRef Search ADS   26 Wu, D. and Shao, L. ( 2014) Deep Dynamic Neural Networks for Gesture Segmentation and Recognition. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 552–571. Springer International Publishing. 27 Wu, D. and Pigou, L. ( 2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. , 38, 1583. Google Scholar CrossRef Search ADS PubMed  28 Pigou, L., Dieleman, S., Kindermans, P.J. and Schrauwen, B. ( 2014) Sign Language Recognition Using Convolutional Neural Networks. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 572–578. Springer International Publishing. 29 Jarrett, K., Kavukcuoglu, K., Ranzato, M. and Lecun, Y. ( 2010) What is the Best Multi-stage Architecture for Object Recognition? IEEE Int. Conf. Computer Vision, Kyoto, Japan, September 27–October 4, 2009, pp. 2146–2153. IEEE, New Jersey, USA. 30 Sutskever, I., Martens, J., Dahl, G. and Hinton, G. ( 2013) On the Importance of Initialization and Momentum in Deep Learning. Int. Conf. Machine Learning, Atlanta, USA, June 16¨C21, 2013, pp. III-1139. ACM, New York, USA. 31 Zhu, G., Zhang, L., Shen, P. and Song, J. ( 2017) Multimodal gesture recognition using 3d convolution and convolutional lstm. pp. 4517–4523. 32 Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R.R. ( 2012) Improving neural networks by preventing co-adaptation of feature detectors. Comput. Sci. , 3, 212– 222. 33 Molchanov, P., Gupta, S., Kim, K. and Pulli, K. ( 2015) Multi-sensor System for Driver’s Hand-Gesture Recognition. IEEE Int. Conf. Workshops on Automatic Face and Gesture Recognition, Ljubljana, Slovenia, May 4–8, 2015., pp. 1–8. IEEE, New Jersey, USA. 34 Bastien, F. et al.   ( 2012) Theano: new features and speed improvements. Comput. Sci. , 11, 1– 10. 35 Bastien, F. et al.   ( 2015) Blocks and fuel: frameworks for deep learning. Comput. Sci. , 6, 1– 5. Google Scholar CrossRef Search ADS   36 Rivasperea, P., Cotaruiz, J., Venzor, J.A.P., Chaparro, D.G. and Rosiles, J.G. ( 2013) Lp-svr model selection using an inexact globalized quasi-Newton strategy. J. Intell. Learn. Syst. Appl. , 5, 19– 28. 37 Cihan, N., Kindiroglu, A.A. and Akarun, L. ( 2014) Gesture Recognition Using Template Based Random Forest Classifiers. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 579–594. Springer International Publishing. 38 Chang, J.Y. ( 2014) Nonparametric Gesture Labeling from Multi-modal Data. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 503–517. Springer International Publishing. 39 Monnier, C., German, S. and Ost, A. ( 2014) A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 491–502. Springer International Publishing. Footnotes 1 The dataset can be download at: https://pan.baidu.com/s/1pL2qwuZ 2 Available at http://pan.baidu.com/s/1dEX29R7 Author notes Handling editor: Yannis Manolopoulos © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

3D Convolutional Neural Networks for Dynamic Sign Language Recognition

Loading next page...
 
/lp/ou_press/3d-convolutional-neural-networks-for-dynamic-sign-language-recognition-yi0EXH88TL
Publisher
Oxford University Press
Copyright
© The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxy049
Publisher site
See Article on Publisher Site

Abstract

Abstract Automatic dynamic sign language recognition is even more challenging than gesture recognition due to the fact that the vocabularies are large and signs are context dependent. Previous works in this direction tend to build classifiers based on complex hand-crafted features computed from the raw inputs. As a type of deep learning model, convolutional neural networks (CNNs) have significantly advanced the accuracy of human gesture classification. However, such methods are currently used to treat video frames as 2D images and recognize gestures at the individual frame level. In this paper, we present a data driven system in which 3D-CNNs are applied to extract spatial and temporal features from video streams, and the motion information is captured by noting the variation in depth between each pair of consecutive frames. To further boost the performance, multi-modal of video streams, including infrared, contour and skeleton are used as input for the architecture and the prediction results estimated from the different sub-networks were fused together. In order to validate our method, we introduce a new challenging multi-modal dynamic sign language dataset captured with Kinect sensors. We evaluate the proposed approach on the collected dataset and achieve superior performance. Moreover, our method achieves a mean Jaccard Index score of 0.836 on the ChaLearn Looking at People Gesture datasets. 1. INTRODUCTION As an important bridge for the communication gap between deaf and normal people, gesture recognition has drawn increasing attention from researchers due to its growing potential in areas such as human–computer interaction. However, there are still numerous challenging problems especially in the video-based applications for the dynamic sign language recognition because it is difficult to capture the information from hand-shape and spatial–temporal trajectories of the upper limb. The first problem is to detect and track the moving limbs. Due to the impacts of environmental factors such as complex background and different lighting conditions, gesture segmentation has always been a difficult task. In addition, there may exist occlusions sheltering in the duration of a gesture, and it can produce various complex sign language vocabularies because human’s limbs have certain attributions of deformation. Second, it is a nontrivial challenge to integrate the temporal and spatial features together. There was a great deal of flexibility in spatial dimension for human hands, thus the gestures split from a common sign language vocabulary always differ from each other due to the diverse styles of expression habits of performers in trajectory, amplitude, direction and position. Even if a gesture is repeated several times by the same person, the speed and the amplitude of each movement could not be exactly equal. Moreover, observing the same gesture from different viewpoints also results in contrary appearances. Finally, the sign language recognition systems based on computer vision involve a large amount of data during video preprocessing, and previous approaches tend to use multi-dimensional parameters to extract features from gestures to ensure a better performance. This presents the challenge of detecting and classifying gestures immediately upon or before their completion to provide rapid feedback [1]. Hand-crafted features have dominated objection recognition and image classification tasks in the past many years. Previous studies of gesture recognition focused mainly on adapting hand-crafted features. These approaches typically include two stages: first, building a detector to obtain the spatial–temporal features from raw video frames and the second is to train the classifiers based on the obtained features. Different feature detection methods such as Hessian3D [2] and Cuboids [3] have been widely used. For classifiers, popular methods are conditional random fields [4], hidden Markov models (HMM) [5] and support vector machines (SVM) [6]. It is well known that features are important for the task since the choice of features is highly problem-dependent. Especially, for dynamic sign language recognition, different gesture vocabularies may appear dramatically different in terms of their appearances and motion patterns. However, feature learning is not a part of such classifiers discussed above and needs to be performed separately to extract features such as edges, gradients, pixel intensities and object shapes. In recent years, deep learning has made a tremendous impact on pattern recognition, computer vision and some other fields. Deep neural network can extract a hierarchy of features by building high-level features from low-level ones. Such architectures can be trained using either supervised learning or unsupervised learning, and obtain high-performance level which was previously unattainable on the tasks of image classification [7] and human action recognition [8]. The convolutional neural networks (CNNs) are a type of supervised learning in which interleaved trainable filters and local neighborhood pooling operators act on the raw input images, resulting in a hierarchy of progressively complex features. It has been shown that CNNs can be invariant to certain variations such as surrounding clutter lighting and pose [9]. As a class of effective deep neural network architectures for automated feature construction, CNNs have been primarily applied to 2D images and such works does not take into account the motion information encoded in multiple contiguous frames [10]. In order to effectively incorporate the motion information into video analysis, we consider the use of 3D-CNNs for dynamic sign language recognition and perform 3D convolutions in the convolutional layers so that the discriminative features encoded in multiple adjacent frames along both the spatial and temporal dimensions are captured. Where in fact, approaching these models to solve problems involves the understanding of continuous frame content which is still in its early stage. There are some recent works [11] which explored the applications of this idea. The challenges are lack of sufficiently large datasets and the high cost of labeling data in many actual application scenarios, as well as the increased modeling computational complexity originated from the mutual connection information among various continuous video frames. Due to the widespread usage of the Microsoft Kinect, there has been an outpouring in relevant methods for human action and gesture recognition. Several public new datasets [12, 13] have brought researchers conveniences in designing novel algorithms which rely on multi-modal data. However, most of the existing public datasets lack of effective and accurate labeling or are stored in a single data format. Given the limitations of public available datasets for multi-modal dynamic gesture recognition, we present a novel upper limb labeled dataset of 20 classes to train a deep neural network for dynamic sign language recognition task. Each class was intended for human–computer interfaces in the real-world environment of museum and was recorded by Kinect sensors. We also present the classification result of our approach on the ChaLearn 2014 [14] data as the publicly available dataset to evaluate the performance of the gesture recognition, and our model achieves competitive performance. The major contributions of this paper can be summarized as follows: We collected a novel Sign Language Video in Museums dataset, which we refer to as SLVM.1 The SLVM contains approximately 7K manually labeled sign performances in continuous video sequences, which belongs to 20 categories of commonly used sign language in real-world environment of museums in China. A novel 3D CNN for dynamic sign language recognition is proposed. This architecture can extract both the spatial and temporal features from raw video data automatically without any prior knowledge, thus capture motion information from dynamic gestures. Compared with previous work has train a 3D CNN for gesture recognition task, our model is deeper and iteratively integrated discriminative data representations from multi-modal data. Furthermore, fine-tuned the architectures on the pre-trained model is adopted and considered as an optional skill to prevent over-fitting. An effective fusion strategy for sub-networks with different input modalities is proposed, which can improve the performance of the provided architecture to compensate the errors of the separate classifiers, thus ensure the robustness of the network to missing signals. The remainder of this work is organized as follows. Section 2 introduces the related works about human action recognition and recent works on CNNs applied on gesture recognition. Section 3 describes the multi-modal data collection and the data preprocessing work. In Section 4, the details of our network model are presented. Finally, Section 5 presents experimental analysis and our final remarks. 2. RELATED WORKS Gesture recognition has been studied for decades. Traditional approaches to vision-based gesture recognition from video typically include sparse or dense extraction of spatial or spatio-temporal engineered descriptors followed by classification [15]. This work addressed more general aspects of hand-crafted feature-based methods and deep learning-based methods. 2.1. Hand-crafted feature-based methods In the early stage, many hand-crafted features for effective video analysis have been introduced in the area of sign language and action recognition. Typically the histograms of oriented gradients or optical flow are captured, and then hidden Markov model, finite-state machine and dynamic time warping as classifiers are commonly applied. Starner et al. [5] introduced an HMM-based system which uses color cameras as sensors to track unadorned upper limb in real time and interpret American Sign Language with a vocabulary of 40 words using HMMs. Zaki et al. [16] presented another recognition system of American Sign Language with a lexicon of 30 words. They constructed appearance-based representations and an upper limb tracking system which uses HMM as the classifier and achieves an accuracy rate of 89.09% on the RWTH-BOSTON-50 dataset which includes 50 isolated signs. Ohn-Bar et al. [12] evaluated various hand-crafted features and classifiers for in-car dynamic gesture recognition with RGB-D data. They obtain the state-of-the-art performance with a combination of histogram of gradient features and an SVM classifier. Artificial neural network is another broadly used classifier for human action recognition tasks. A sign language recognition system based on 3D Hopfield neural networks is presented in [17], which included a vocabulary of 15 different hand gestures and achieved 91% recognition rate. Kouichi et al. [18] proposed a neural network to recognize a finger alphabet of 42 symbols. Kim et al. [19] took 3D coordinates and angles of hands as an input and used a fuzzy min–max neural network for hand gesture recognition. This method resulted in an accuracy rate of 85% on 25 isolated gestures. However, hand-crafted features-based methods cannot take all factors into consideration at the same time, most these approaches either employ pre-segmented video sequences or treat detection and classification as separate problems. 2.2. Deep learning-based methods In contrast to the traditional approaches that rely on the construction of hand-crafted features, there is a growing trend toward feature representations learned by deep neutral networks. Karpathy et al. [20] proposed a CNN-based model to classify large-scale video operating at two spatial resolutions. Zisserman et al. [11] increase the amount of training data by multi-modal learning and extract the spatial concurrent with temporal features by two-stream CNNs. The key of these approaches is to learn spatial and temporal features consecutively. 3D convolution has been applied to video stream analysis and classification in recent years. Tran et al. [21] proposed a 3D-CNN to analyze a series of short video clips and average the network’s responses for all clips. Molchanov et al. [1] applied 3D-CNN on the whole video sequence and introduced space–time video augmentation techniques to avoid over-fitting, which motivated our work. The proposed 3D-CNN is based on the spatio-temporal features extracted from video data. It achieved significant performance improvements compared to hand-crafted features as baselines and also obtained promising accuracy on large-scale datasets, recently many approaches are on the basis of it like [22, 23]. Moreover, two-stream-based frameworks have obtained remarkable performances for human action recognition. Duan et al. [24] combined a convolutional two-stream consensus voting network and a 3D ConvNet for dynamic gesture recognition. Their method obtained the state-of-the-art performances on the RGB-D HuDaAct datasets. Another obvious approach is multi-modal fusion. Neveroa et al. [25] proposed to combine color image from hand regions and a pose descriptor consisting of seven subsets for upper-body skeleton features to recognize gestures. Wudi et al. [26] presented a system for multi-modal gestures, which consists of two deep neural networks: a CNN to manage and fuse batches of depth and RGB data, and a Gaussian–Bernouilli Deep Belief Network (DBN) to handle skeletal dynamics. In the follow-up work [27], they demonstrated that multi-modal fusion of the different inputs results in a clear improvement over unimodal approaches due to the complementary nature of the different input modalities. In order to take full use of the advantages of CNNs, we propose a multi-scale 3D-CNN to integrate multi-modal visual data for sign language detection and recognition. Each type of data stream provides several adjacent frames as input. Our main innovation is a new training strategy that initializes the sub-components of individual modalities carefully and then fine-tuned the model to prevent over-fitting based on the pre-trained model. Moreover, we mixed the individual results together to improve the performance of the proposed model. In the experiments, we validate our proposed deep network make a clear performance improvement compared with unimodal approaches. 3. DATA COLLECTION AND PREPROCESSING 3.1. Data collect by kinect Table 1. Evaluation of different approaches on the SLVM dataset. Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  View Large Table 1. Evaluation of different approaches on the SLVM dataset. Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  Training model  Accuracy (%)  Improvement (%)  Contour  Tanh units  79.1    +ReLUs  83.5  5.43  +LCN  85.4  2.39  +Dropout  87.6  2.57  Infrared  Tanh units  79.9    +ReLUs  84.3  5.50  +LCN  86.2  2.25  +Dropout  88.3  2.43  Multi-modal Fusion  89.2  1.01  View Large In this section, we present our experience of using Kinect sensors pre-recorded video samples of dynamic sign language dataset. We have developed a record system based on Microsoft Kinect for Windows SDK V2 to capture multi-modal data, we called it ‘Sign Language Recorder’.2 Each gesture object obtained the infrared and contour image sequences, and sources video was recorded at 30 FPS rate stored with resolution 512*424 in MP4 file format. In addition, skeletal data are also contribute to the network’s prediction by tracking the trajectory of hand which are crucial for discrimination in different sign language classes performed in similar body poses. The multi-modal data include 6800 samples which constitute a complete vocabulary of 20 Chinese sign language categories performed by 17 different users, and each gesture in the dataset is accompanied by a ground truth label as well as information about its start and end frames. The storage structure of multi-modal data is described in Fig. 1, taking contour data as an example: the camera views the physical space in front of the Kinect and generates a contour image of human in the plane towards the sensor, this contour image is then used for background removal, followed by generation of the depth profile of the human body. In order to improve the computational efficiency to ensure the real-time communication between human and computer, we abandoned the traditional methods [20] of using higher-dimensional data and instead applied infrared and contour data which consist of single channel to train the deep neural networks respectively. In addition, skeleton data are used to precisely track the trajectory of hands. Figure 1. View largeDownload slide Overview of our method on an example from the 2017 Sign Language Video in Museums (SLVM) dataset. Figure 1. View largeDownload slide Overview of our method on an example from the 2017 Sign Language Video in Museums (SLVM) dataset. 3.2. Sign language temporal segmentation Each vocabulary of sign language in the SLVM dataset has a different duration. In order to normalize the temporal lengths of the sign language, we first re-sampled each sign language vocabulary to 32 frames using sliding-window gesture detector by dropping or repeating frames. Dynamic sign language generally consists three temporally overlapping phases ( P1, P2, P3) as: preparation, nucleus and retraction [22]. The preparation and retraction regarded as beginning and ending action can be quite similar for different gestures and hence deemed as less useful or even detrimental to accuracy. As illustrated in Fig. 2, for sign language ‘OK’, the gesture sequence can be essentially into P1: ready to lift arm to the specified level, P2: the motions with OK gesture, P3: slowly to lay down the arm. Figure 2. View largeDownload slide Sliding windows for gesture temporal segmentation. Figure 2. View largeDownload slide Sliding windows for gesture temporal segmentation. More precisely, since our goal is to capture the variation in the demonstration process of sign language, this motivates us to train a binary classifier which principally rely on the nucleus phase to distinguish between stationary time and active period. The classifier is a two-layer fully connected neural network taking the articulated pose descriptor as an input. All training frames having a sign language word label are used as positive examples, while frames that lie before and after such labeled objects are considered as negatives. Each frame is thus annotated with a label ‘static’ or ‘dynamic’ correctly. By using this classifier, we can predict beginning and ending frames of every sign language in the video samples in the following way as described in Algorithm 1. Algorithm 1 The approach for Sign Language Temporal Segmentation Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  Algorithm 1 The approach for Sign Language Temporal Segmentation Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  Data: input  Gxs—the original starting frame number of gesture x  Gxe—the original ending frame number of gesture x  Lx=Gxs−Gxe—the length of gesture      1.  for x1→x20 do  2.    if Lx>32 then  3.     Gxns=Gxs+(Lx−32)/2, where Gxns is the new starting frame number of gesture x  4.     Gxne=Gxns+32, where Gxne is the new ending frame number of gesture x  5.    else Lx<=32 then  6.     Gxs is the new starting frame number of gesture x  7.     Gxne=Gxs+32, Gxne is the new ending frame number of gesture      Result: output  3.3. Preprocessing There will be a high demand for computational complexity if training directly with raw videos captured by Kinect, which are 512*424 pixels sequence images. Therefore, before inputting the data to deep neural network, we preprocess all videos by first cropping upper body and highest limb using the captured skeleton information from dataset. Furthermore, the background removal using median filtering and noise reduction with threshold method [28] were applied to the cropped data. Contour frames are normalized to zero mean, and infrared frames are only normalized by the image standard deviation. As shown in Fig. 3, the final inputs to sign language classifier contain four video samples: upper body with infrared frames, upper limb with infrared frames, upper body with contour frames, and upper limb with contour frames. All of the frame sources are of 32*64*64 pixels (32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension). Figure 3. View largeDownload slide Overview of our 3D-CNN classifier for sign language recognition. Figure 3. View largeDownload slide Overview of our 3D-CNN classifier for sign language recognition. 4. MODELS In this section, we present our architecture of 3D convolution neural networks for multi-modal dynamic sign language detection and recognition task. Dropout and regularization techniques are used to reduce over-fitting. 4.1. 3D convolution Convoltional Neural Networks are biologically-inspired variants of Multi-layer perception. The idea is that a filter window called local receptive field moves over each unit from previous layer. In 2D CNNs, each unit in the convolutional layer receives inputs from a set of units located in the filter window and calculated by the following equation:   ϒxy=f(b+∑i=0n∑j=0nwijv(x+i)(y+j)) (1)where f(•) is the neural activation function like Sigmoid, Tanh or Relus, ϒxy is a unit in feature map at position (x,y), v(x+i)(y+j) is input unit at position (x+i,y+j), wij is a (n×n) array of shared weights and b is the shared value for the bias of the feature map. In 2D CNNs architecture, convolutions and pooling are applied to the 2D feature maps to calculate features from the spatial dimensions only. When applied to sign language recognition problems based on video analysis, it is also desirable to capture motion information encoded in multiple consecutive frames along with spatial features. For this reason, we propose to perform 3D operators in the convolution and pooling stages so that discriminative features from both spatial and temporal dimensions are captured. The 3D convolution is achieved by convolving a 3D kernel to the cube of temporal formed by stacking multiple contiguous frames together. With this structure, feature maps in the convolution layer are connected to multiple successive frames in the previous layer, thereby capturing motion information. Similar to 2D convolution operator, 3D convolution is calculated by the following equation:   ϒxyt=f(b+∑i=0n∑j=0n∑k=0mwijkv(x+i)(y+j)(t+k)) (2)where wijk is the (i,j,k)th value of the kernel connected to feature map in previous layer, m is the size of the 3D kernel along the temporal dimension. The motion information in multiple contiguous frames is captured by this setting. Since the weights of kernel are replicated throughout the entire cube, the 3D convolution kernel can only extract the same type of features from frame cube. The fundamental principle of design CNNs architecture is that the number of feature maps at next layers should be increased so that multiple types of features from the same set of lower level feature maps can be generated. Similar to the 2D convolution case, this can be achieved by using multiple 3D convolution with distinct kernels to the same location in the previous layer. 4.2. Our 3D-CNN architecture for sign language recognition Inspired by the framework successfully applied to human action recognition [8], the CNNs architectures suitable for sign language recognition task can be devised based on the design principle of the 3D convolution described above. Our proposed model is a data driven learning system that consisted of two sub-networks: an Infrared-classifier network (I-CN) with parameter ΘI and a Contour-classifier network (C-CN) with parameter ΘC. On top of that, the dimensionality reduction methods and complicated preprocessing is naturally embedded in the framework. We use Microsoft Kinect to collect the SLVM dataset. The dataset includes infrared stream, contour stream and body skeleton data which were used to track the trajectory of limbs simultaneously. Unlike other traditional approaches that used color images consisting of RGB three channels, the infrared and contour images just contain single channel. The complete architecture is depicted in Fig. 3, for each type of visual source we consider 32 frames of size 64*64 centered on the current frame as input to the 3D-CNN. This results in four feature maps denoted by infrared-Body, contour-Body, infrared-Hands, contour-Hands. Each feature map contains 32 stacked frames from the corresponding channel as a cube in temporal dimension. Multiple source inputs usually lead to better performance compared to single format data source. Our deep neural network consists of 11 layers including the input layer. After the input layer, the next convolution layer consists of 16 feature maps produced by 5*5*5 spatial and temporal 3D convolutional kernels, followed by local contrast normalization (LCN) [29]. A 3D max sub-sampling with strides (2,2,2) is then applied. The second convolution layer used 32 feature maps with 5*5*5 3D kernels followed by 3D sub-sampling with strides (2,2,2). The third convolution layer is composed of 48 feature maps with 4*4*4 3D kernels followed by 3D max sub-sampling with strides (2,2,2). There were further followed by three fully connected softmax classifier with 512 neurons in their hidden layer, which produced class-membership probabilities P(C∣xI,ΘI) and P(C∣xC,ΘC) for the 20 sign language classes. Rectified linear unit (ReLUs) activation functions were used to speed up training in the architecture except for softmax layers. We computed the output of the softmax layers as   P(C∣x,Θ)=softmax(xk)=exp(xk)∑i=1kexp(xi) (3)where xi was the output of the neuron i. Algorithm 2 CNN-based classifier for sign language recognition Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  Algorithm 2 CNN-based classifier for sign language recognition Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  Data: input  D1: I1={I1j}j∈[1…t]—raw input of infrared sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D2: C1={C1j}j∈[1…t]—raw input of contour sequence in the form of 32*64*64.    Where 32 frames in temporal dimension and 64*64 is the resolution of images in spatial dimension.  D3: S1={S1j}j∈[1…t]—raw input of skeletal feature sequence.  Label: Y={Yj}j∈[1…t]—frame based local label.      1.  for n→0 to 1 do  2.    if n=0 then  3.     Pre-training the Sub-networks for Infrared using data D1, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  4.     Produce class-membership probabilities P(C∣xI,ΘI) from Equation (3).  5.    else  6.     Pre-training the Sub-networks for Contour using data D2, and Supervised fine-tune the 3D-CNN using label Y by standard mini-batch stochastic gradient descent back-propagation.  7.     Produce class-membership probabilities P(C∣xC,ΘC) from Equation (3).  8.  Fuse the two networks by feed their respective class-membership probabilities to Equation (7).      Result: output  4.3. Initialization We initialized the weights of the fully connected hidden layers and the softmax layer with random samples from a normal distribution with μ=0 and σ=0.03. In order to have a non-zero partial derivative during training, except for the softmax layer, the biases for all layers have the fixed initial value of 1. The biases in the softmax layer were set to 0. The weights of the 3D convolution layers are randomly initialized from a uniform distribution between [−Wb,Wb], where Wb=6/(ni+no), and ni, no represent the number of neurons in the input and output layers, respectively. 4.4. The process of optimization The process of training CNNs architecture involves the optimization of the networks parameters Θ to minimize a cost function for the dataset D and fine-tune by back-propagation algorithm [9]. We selected negative log-likelihood as the cost function:   L(Θ,D)=−1∣D∣∑i=0∣D∣log(P(C(i)∣x(i),Θ)) (4) We implemented optimization by virtue of stochastic gradient descent with mini-batches of size 16 appropriately for the I-CN and the C-CN. Nesterov’s accelerated gradient descent (NAG) [30] with a fixed momentum coefficient of 0.9 is also used to update the network’s parameters θ∈Θ, at every iteration t as:   ▽f(θt)=⟨δLδ(θt−1)⟩batchmt+1=κmt−ε∇f(θt+κmt)θt+1=θt+mt+1 (5)where ▽f(θt) is the value of gradient of the log-likelihood cost function with respect to the parameter θt averaged across the mini-batch, mt is the momentum, κ is the momentum coefficient, ε is the learning rate initialized at 0.07 dropped to its 1/10 every 6500 iterations if the cost function did not improve by more than 10% [31]. Weight decay usually led to better generalization for sign language from different subjects. We observed that the NAG will converge faster than stochastic gradient descent with only momentum in our experiments described in Section 5. 4.5. Regularization and dropout In order to reduce over-fitting, we applied regularization which was regarded as a key component to successful generalization of the network. During training, the joint learning objective can be generally formulated as follows:   L=L0+λ2n∑ww2 (6)where the first term represents negative log-likelihood described above in Equation (4), the second term is L2 regularization. Optimizing the joint learning objective can be viewed as a way of compromising process between minimizing the original cost function and finding small weights depending on the value of λ. We also used dropout technique in our model, which has been very successful in improving the performance of deep neural networks [32]. During dropout, the outputs of the fully connected hidden layer was randomly set to 0 with probability ρ=0.5, the network will learn a set of weights and biases by repeating the process, consequently, this will delete half of the neurons in the back-propagation step of the iteration training. To compensate for that, the weights of following the dropped layer were multiplied by 2 in the forward propagation stage. 4.6. Multi-modal fusion CNNs have been shown to be more effective while combining data from different modalities [33]. Assume that the weights of the two sub-networks are pre-trained, the next important issue is to determine a fusion strategy. Our fusion model combines the class-membership probabilities estimated from the separate single-modality networks by linear combination to compute the final probabilities for the sign language classifier:   P(C∣x)∝α*P(C∣xI,ΘI)+(1−α)*P(C∣xC,ΘC) (7) Here, the different probabilities are provided by the sub-networks described in Section 4.2, α represents the coefficient to control the contributions of each stream and its value is optimized through cross-validation. Usually, the best value of a is very close to 0.5, indicating that both separate networks with different input modalities are equally important. Following this strategy, we combine the two results together at the output layer of our model and finally predict the class label as: c*=argmaxP(C∣x) . 4.7. GPU acceleration A potential concern of training deep neural networks is time consuming, it usually costs several weeks or months to train a CNN with million-scale in large-scale video dataset. Fortunately, it is still possible by using a GPU-based calculation with the help of CUDA-Convnet for parallel processing to achieve real-time efficiency. We train our sign language classifier with Python libraries Theano [34], and PyLearn2 [35] for the fast implementation. Experiments were conducted on one modern machine with an eight-core processor (Intel Core i7-6700K), 32GB SDRAM and a NVIDIA GeForce GTX 1070 GPU with 11264MB of memory. The training speed our 3D-CNN model was three to five times faster than before by using GPU, and the training time per epoch is about 2100 s, this allows us to complete the training in 3 days. 5. EXPERIMENT AND RESULTS This section reports how our strategies work by illustrating groups of experiments. First, SLVM and ChaLearn Looking at People Gesture datasets for our experiments will be introduced briefly; then present the experimental protocol we followed and comparison with previous methods. 5.1. Experiments on SLVM dataset As mentioned before in Section 3.1, most of the existing datasets lack of effective and accurate labeling or stored in a single data format. Given the limitations of existing datasets, we built a publicly available sign language dataset by ourselves. SLVM has 20 vocabularies that are widely used during visit to the museum. Each sign is played by 17 different users, every signer repeated 20 times for each word, thus each gesture has 340 samples, and we have 6800 samples in total. The infrared image and the contour image were recorded simultaneously via Kinect sensor. The experiments were performed using a 7-fold cross-validation approach to evaluate the various settings of the proposed framework, all the samples are divided into three mutually exclusive subsets (5100 training, 850 validation and 850 test instances), no sample performed by the same person appears both in the training and testing subsets. In the training phase, the training and validation datasets were used. In order to maximize generalization performance of the framework, some form of model selection [36] is required to pick optimal parameters based on the equal accuracy rate for the validation set. The hyper-parameters ϑ=( ρ, λ, ε), where ρ is the dropout probability and λ is the regularization parameter controlling the bias variance trade-off described above in Section 4.5. We train the proposed deep architecture from scratch. Therefore, higher learning rates are needed [1]. The initial learning rate ε is set to 0.07 and dropped to its 1/10 every 6500 iterations if the cost function did not improve by more than 1/10. We further using the rest samples constitute the testing data and calculate the accuracy as the final fitness value, where the sequences were held out for testing while all other sequences were used for training and validation. The classification accuracies of the training process are shown in Fig. 4, the 98 epoch are sufficient to reach the lowest test error, where the best training and prediction accuracies are 95.8% and 89.2% respectively. Figure 4. View largeDownload slide Classification accuracies of our method on SLVM dataset. Figure 4. View largeDownload slide Classification accuracies of our method on SLVM dataset. Our most notable experiments are the models with ReLUs and LCN. Furthermore, dropout is used as main approaches to reduce over-fitting. As shown in Table 1, neural networks with ReLUs non-linearities have been prove to be with an improvement of 5.5% respect to Tanh units, possibly because the ReLUs net proved faster to train than standard Tanh units. LCN added in the first two layers of the proposed deep architecture, also obtains more than 2% improvement in single modalities. Dropout is a technique that provides a way of approximately combining exponentially many different networks efficiently and prevents over-fitting [32]. The results of comparisons are shown that it typically improves the performance by at least 2.4% of proposed deep architectures, even when no data augmentation is used. Miao et al. [23] use three different modalities and combine the features extracted by them to boost the final performance to a large extent. This research has motivated our work to employ a fusion scheme for higher recognition rate. As can be observed from both performance measures and the accuracy rate, individually, infrared module data ( accuracy=88.3%) usually performs better than the contour module ( accuracy=87.6%). This result makes sense because information in the infrared data is more robust to indoor lighting change and it can preclude the background noise more easily. On the contrary, the contour images tend to ignore details of the hand-shape. Another interesting conclusion from Table 1 is that using fusion scheme described in Section 4.6 to combine infrared images with contour modality usually performs better than only using infrared video (the accuracy improved to 89.2%). One possible explanation is that some sign language words have similar trajectories, but the details of the hand-shape and the motion ranges of upper limb are quite different. Therefore, combining both features is a better choice. In order to better understand the performance of the model, we show the confusion matrix in Fig. 5, it tells us which word caused confusion and display the fraction of true positives for each sign language word on the diagonal. We observed that almost all signs are confused, this is because that some signs are easier to be recognized than others. Taking ‘follow me’ as an example, which is expressed in two hands, the motion trajectory of arms is very different from many other gestures. Figure 5. View largeDownload slide Confusion Matrices for our method on SLVM dataset. Figure 5. View largeDownload slide Confusion Matrices for our method on SLVM dataset. At the same time, as the proposed 3D-CNN extract features from 32 consecutive frames, therefore, if the model cannot recognize the class of an input, its predicted results will be uncertain because some gestures are difficult to distinguish from other gestures due to the trajectories of the upper limb are overlapping. 5.2. Experiments on ChaLearn dataset To validate our proposed sign language recognition algorithm, we evaluated our method on benchmark dataset of ChaLearn LAP [13] which was made public for the gesture classification challenge in 2014. The focus is on user independent spotting and multiple samples of gestures, which means learning to detect and recognize gestures from multi-modal instances for each category performed by different people. As illustrated in Table 2, this dataset contains 13 858 RGB-D videos of 20 upper-body Italian conversational sign language gestures performed by 27 different users with variations in surroundings, clothing and lighting, the videos are recorded with a consumer RGB-D sensor. Table 2. ChaLearn 2014 Gesture Dataset Statistics. Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  View Large Table 2. ChaLearn 2014 Gesture Dataset Statistics. Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  Subsets  Labeled instances  Length (min)  Training  7754  470  Validation  3362  230  Testing  2742  240  View Large Several measures can be used to evaluate the performance of architectures. In this work, we follow the ChaLearn LAP 2014 Challenge score to measure the gesture recognition performance. The competition score is based on the Jaccard Index, which is defined as follows:   J(s,n)=∣A(s,n)∩B(s,n)∣∣A(s,n)∪B(s,n)∣ (8)where A(s,n) is the ground truth label of gesture n in sequence s, and B(s,n) is the prediction of algorithm output for such a gesture at sequence s. The Jaccard Index J(s,n) can be seen as the overlap rate between A(s,n) and B(s,n). To compute the final score, the Jaccard Index is averaged over all gesture classes and all sequences:   J(mean)=1NS∑s=1S∑n=1NJ(s,n) (9)where N=20 is the number of categories and S is the number of sequences in the test set. We use this mean Jaccard Index as the final evaluation criterion. As the algorithm proposed above, we have access to the RGB map and depth map to train our sub-networks respectively. Two different training strategies are utilized to evaluate the proposed architecture: Strategy 1: We train the networks first on the ChaLearn 2014 dataset from scratch. The learning rate is initialized at 0.05 and dropped to its 1/10 every 9700 iterations [1]. Rectified Linear Units, dropout and regularization were used to improve the performance of our network. The final accuracy on the test set is 92.4% and mean Jaccard Index score 0.793, meaning our method can be used to accurately classify different gestures of dynamic sign language. Strategy 2: We fine-tune the networks for ChaLearn 2014 dataset based on the pre-trained models of SLVM. The learning rate is initialized at 0.01 and reduced by a factor of 10 after every 9700 iterations. As illustrated in Fig. 6, we find that fine-tune the networks on pre-trained models led to a significant impact on generalization. The confusion matrix for ChaLearn 2014 dataset reveals that recognition rates across gestures is generally kept at mean accuracy 94.5±1.77% although a single gesture of Buonissimo falling below 90% recognition rate. We observe a validation accuracy of 96.3% for our best model and improve the mean Jaccard Index to 0.836. The result of the experiment show that fine-tune on pre-trained models performed better in recognition rates compare with trained from scratch. So, multi-modalities can also be viewed as an effective data-augment method to prevent over-fitting for relative small datasets (Fig. 7). Figure 6. View largeDownload slide Classification accuracies of our method on ChaLearn 2014 dataset. Figure 6. View largeDownload slide Classification accuracies of our method on ChaLearn 2014 dataset. Figure 7. View largeDownload slide Confusion Matrices for our method on ChaLearn 2014 dataset. Figure 7. View largeDownload slide Confusion Matrices for our method on ChaLearn 2014 dataset. A comparison of results with previous work of the challenge are presented in Table 3, the purpose of this comparison is to explore the relative strengths and weaknesses of different learning representations as well as the nuances of multi-modal fusion. The first three works [37–39] in the table use hand-crafted feature representations that are subsequently classified. This result makes the sense that deep neural network can extract high-level features from raw video stream by building hierarchical architecture. Table 3. Comparison of result on the Chalearn Gesture dataset. Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  View Large Table 3. Comparison of result on the Chalearn Gesture dataset. Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  Method  Modalities used  Jaccard  Random Forest [37]  RGB  0.746  MRF [38]  Skeleton+RGB  0.826  Boosted classifier [39]  Skeleton+Depth+RGB  0.833  3D CNN [28]  RGB-D  0.791  Dynamic DNN [27]  Skeleton  0.863  RGB-D  0.787  Multi-modal Fusion  0.879  DNN [25]  Depth+RGB+Audio  0.881  Proposed:  RGB+ Skeleton  0.793  Trained from scratch  Depth+Skeleton  0.807    Multi-modal Fusion  0.815  Proposed:  RGB+ Skeleton  0.817  Fine-tuned on the  Depth+Skeleton  0.829  SLVM  Multi-modal Fusion  0.836  View Large The study of [28] was based on a 3D-ConvNet that learned spatio-temporal features from RGB-D streams. Compared with this 3D-CNN framework, our model is deeper and iteratively integrated discriminative data representations from multi-modal data. The study of [27] utilizes DBN and 3D-CNN for learning contextual frame-level representations. We also note that it is the only method that incorporates more structured temporal modeling, and this is an excellent framework because an HMM-like approach can be suitable for variable length temporal information fusion. The study of [25] fused multiple modalities at several spatial and temporal scales to ensure the robustness of deep architecture to missing signals in one or several channels, and this work achieves the best mean Jaccard Index score in the competition. It is worth noting that deep neural architecture combining multi-scale features usually achieves better performance. 6. CONCLUSION AND DISCUSSION Dynamic sign language recognition involves anticipated challenges such as temporal variance, spatial complexity and movement epenthesis. Therefore, extract spatio-temporal features by using 3D CNNs become the key to effective dynamic sign language recognition methods. In this paper, we have presented 3D-CNN model for continuous dynamic sign language classification on multi-modal data which consists of infrared data, contour data and skeleton data. Without any prior knowledge, our 3D-CNN model can automatically learn spatio-temporal motion information and use them to recognize the entire dynamic signs. Since our recording dataset composed of three modal, our model integrates distinct strategies for different data: (1) skeleton data used to track the trajectory of upper limbs further improves the accuracy of sign language recognition. (2) In contrast to previous mainstream methods that use three channel RGB image data, we use single channel infrared data to improve computational efficiency. (3) Synchronous contour data are used to compensate for the errors of classification from infrared data. On top of that, we used the late fusion strategy to combine the two classification results from sub-networks together. In order to involve more data for training, fine-tune the architectures on pre-trained model also used as another practical skill to prevent over-fitting. We evaluated our model on a new dataset of dynamic sign language and against other benchmarks, the experiment show that our proposed method achieved the competitive result. There are several directions for future work. For the task of capturing temporal information in video sequences, a simple temporal convolution strategy is not sufficient for dynamic sign language recognition, we will investigate the possibility of building a unified model to make better use of the temporal component of the problem, in which the temporal convolutions can be directly connected to the long short-term memory (LSTM) sequence classifier. Furthermore, the SLVM dataset contains 6800 samples totally, in order to further prevent over-fitting and to improve the generalization performance of the classifier, perform offline and online spatio-temporal data augmentation will be another work for us in the future. Funding This work was supported by the National Key Technology Research Program of the Ministry of Science and Technology of China [grant number: 2015BAK33B02]; and National Natural Science Foundation of China [grant number: 61671483]; and Continuing Education Research Foundation of Southwest University of Science and Technology [grant number: 17JYF01]. Appendix A a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  Appendix A a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  a  A scalar (integer or real)  a  A vector  A  A matrix  ∑i=1nai  The sum from i=1 to n of ai  ρ  Dropout probability  λ  Regularization parameter  ε  Learning rate  References 1 Molchanov, P., Gupta, S., Kim, K. and Kautz, J. ( 2015) Hand Gesture Recognition with 3d Convolutional Neural Networks. Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops, Boston, USA, June 7–12 , pp. 1–7. IEEE, New Jersey, USA. 2 Willems, G., Tuytelaars, T. and Gool, L.V. ( 2008) An Efficient Dense and Scale-Invariant Spatio-temporal Interest Point Detector. Computer Vision-ECCV 2008, Marseille, France, October 12–18, 2008, pp. 650–663. Springer, Berlin, Heidelberg. 3 Dollár, P., Rabaud, V., Cottrell, G. and Belongie, S. ( 2005) Behavior Recognition via Sparse Spatio-temporal Features. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, October 15–16, 2005, pp. 65–72. IEEE, New Jersey, USA. 4 Wang, S.B., Quattoni, A., Morency, L.P. and Demirdjian, D., Darrell, T. ( 2006) Hidden Conditional Random Fields for Gesture Recognition. IEEE Comput. Soc. Conf. Computer Vision and Pattern Recognition, New York, USA, June 17–22, 2006, pp. 1521–1527. IEEE, New Jersey, USA. 5 Starner, T., Weaver, J. and Pentland, A. ( 1998) Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. , 20, 1371– 1375. Google Scholar CrossRef Search ADS   6 Dardas, N.H. and Georganas, N.D. ( 2011) Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans. Instrum. Meas. , 60, 3592– 3607. Google Scholar CrossRef Search ADS   7 Krizhevsky, A., Sutskever, I. and Hinton, G.E. ( 2012) Imagenet Classification with Deep Convolutional Neural Networks. Int. Conf. Neural Information Processing Systems, Doha, Qatar, November 12–15, 2012, pp. 1097–1105. Springer, Berlin, Heidelberg. 8 Ji, S., Xu, W., Yang, M. and Yu, K. ( 2013) 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. , 35, 221– 231. Google Scholar CrossRef Search ADS PubMed  9 Lecun, Y., Huang, F.J. and Bottou, L. ( 2004) Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting. Computer Vision and Pattern Recognition, 2004. CVPR 2004. Washington, DC, USA, June 27–July 2, 2004, pp. II-97–104 Vol. 2. IEEE, New Jersey, USA. 10 Ning, F., Delhomme, D., Lecun, Y., Piano, F., Bottou, L. and Barbano, P.E. ( 2005) Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process. , 14, 1360– 1371. Google Scholar CrossRef Search ADS PubMed  11 Simonyan, K. and Zisserman, A. ( 2014) Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. , 1, 568– 576. 12 Ohn-Bar, E. and Trivedi, M.M. ( 2014) Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transport. Syst. , 15, 2368– 2377. Google Scholar CrossRef Search ADS   13 Escalera, S.A. ( 2014) Chalearn Looking at People Challenge 2014: Dataset and Results. Workshop Eur. Conf. Computer Vision. Zurich, Switzerland, September 6–12, 2014, pp. 459–473. Springer, Berlin, Heidelberg. 14 Shu, Z., Yun, K. and Samaras, D. ( 2014) Action Detection with Improved Dense Trajectories and Sliding Window. Eur. Conf. Computer Vision, Zurich, Switzerland, September 6–12, 2014, pp. 541–551. Springer, Berlin, Heidelberg. 15 Ming, J.C., Omar, Z. and Jaward, M.H. ( 2017) A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. Cybern. , 1, 1– 23. 16 Zaki, M.M. and Shaheen, S.I. ( 2011) Sign language recognition using a combination of new vision based features. Pattern Recognit. Lett. , 32, 572– 577. Google Scholar CrossRef Search ADS   17 Huang, C.L., Huang, W.Y. and Lien, C.C. ( 1995) Sign Language Recognition Using 3-d Hopfield Neural Network. Int. Conf. Image Processing, 1995. Proceedings, Washington, USA. October 23–26, 1995, pp. 611–614, Vol. 2. IEEE, New Jersey, USA. 18 Murakami, K. and Taguchi, H. ( 1991) Gesture Recognition Using Recurrent Neural Networks. Conf. Human Factors in Computing Systems, New Orleans, LA, USA, April 27–May 2, 1991, pp. 237–242. ACM, New York, USA. 19 Jong-Sung Kim, W.J. and Bien, Z. ( 1996) A dynamic gesture recognition system for the Korean sign language (ksl). IEEE Trans. Syst. Man Cybern. B , 26, 354– 359. Google Scholar CrossRef Search ADS   20 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R. and Fei-Fei, L., ( 2014) Large-Scale Video Classification with Convolutional Neural Networks. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, June 24–27, 2014, pp. 1725–1732. IEEE, New Jersey, USA. 21 Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M., ( 2014) Learning spatiotemporal features with 3d convolutional networks. pp. 4489–4497. 22 Zhu, G. and Zhang, L. ( 2017) Large-Scale Isolated Gesture Recognition Using Pyramidal 3d Convolutional Networks. Int. Conf. Pattern Recognition, Dhaka, Bangladesh, February 13–14, 2017, pp. 19–24. IEEE, New Jersey, USA. 23 Li, Y. and Miao, Q. ( 2017) Large-Scale Gesture Recognition with a Fusion of rgb-d Data based on the c3d Model. Int. Conf. Pattern Recognition, Dhaka, Bangladesh, February 13–14, 2017, pp. 25–30. IEEE, New Jersey, USA. 24 Duan, J., Zhou, S., Wan, J., Guo, X. and Li, S.Z. ( 2016) Multi-modality fusion based on consensus-voting and 3d convolution for isolated gesture recognition. pp. 44–49. 25 Neverova, N., Wolf, C., Taylor, G. and Nebout, F. ( 2014) Moddrop: adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. , 38, 1692– 1706. Google Scholar CrossRef Search ADS   26 Wu, D. and Shao, L. ( 2014) Deep Dynamic Neural Networks for Gesture Segmentation and Recognition. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 552–571. Springer International Publishing. 27 Wu, D. and Pigou, L. ( 2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. , 38, 1583. Google Scholar CrossRef Search ADS PubMed  28 Pigou, L., Dieleman, S., Kindermans, P.J. and Schrauwen, B. ( 2014) Sign Language Recognition Using Convolutional Neural Networks. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 572–578. Springer International Publishing. 29 Jarrett, K., Kavukcuoglu, K., Ranzato, M. and Lecun, Y. ( 2010) What is the Best Multi-stage Architecture for Object Recognition? IEEE Int. Conf. Computer Vision, Kyoto, Japan, September 27–October 4, 2009, pp. 2146–2153. IEEE, New Jersey, USA. 30 Sutskever, I., Martens, J., Dahl, G. and Hinton, G. ( 2013) On the Importance of Initialization and Momentum in Deep Learning. Int. Conf. Machine Learning, Atlanta, USA, June 16¨C21, 2013, pp. III-1139. ACM, New York, USA. 31 Zhu, G., Zhang, L., Shen, P. and Song, J. ( 2017) Multimodal gesture recognition using 3d convolution and convolutional lstm. pp. 4517–4523. 32 Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R.R. ( 2012) Improving neural networks by preventing co-adaptation of feature detectors. Comput. Sci. , 3, 212– 222. 33 Molchanov, P., Gupta, S., Kim, K. and Pulli, K. ( 2015) Multi-sensor System for Driver’s Hand-Gesture Recognition. IEEE Int. Conf. Workshops on Automatic Face and Gesture Recognition, Ljubljana, Slovenia, May 4–8, 2015., pp. 1–8. IEEE, New Jersey, USA. 34 Bastien, F. et al.   ( 2012) Theano: new features and speed improvements. Comput. Sci. , 11, 1– 10. 35 Bastien, F. et al.   ( 2015) Blocks and fuel: frameworks for deep learning. Comput. Sci. , 6, 1– 5. Google Scholar CrossRef Search ADS   36 Rivasperea, P., Cotaruiz, J., Venzor, J.A.P., Chaparro, D.G. and Rosiles, J.G. ( 2013) Lp-svr model selection using an inexact globalized quasi-Newton strategy. J. Intell. Learn. Syst. Appl. , 5, 19– 28. 37 Cihan, N., Kindiroglu, A.A. and Akarun, L. ( 2014) Gesture Recognition Using Template Based Random Forest Classifiers. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 579–594. Springer International Publishing. 38 Chang, J.Y. ( 2014) Nonparametric Gesture Labeling from Multi-modal Data. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 503–517. Springer International Publishing. 39 Monnier, C., German, S. and Ost, A. ( 2014) A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition. Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, September 6–7 and 12, 2014, pp. 491–502. Springer International Publishing. Footnotes 1 The dataset can be download at: https://pan.baidu.com/s/1pL2qwuZ 2 Available at http://pan.baidu.com/s/1dEX29R7 Author notes Handling editor: Yannis Manolopoulos © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

The Computer JournalOxford University Press

Published: May 14, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off