TY - JOUR
AU1 - Bulbul, Mohammad, Farhad
AU2 - Islam,, Saiful
AU3 - Zhou,, Yatong
AU4 - Ali,, Hazrat
AB - Abstract This paper presents a simple, fast and efficacious system to promote the human action classification outcome using the depth action sequences. Firstly, the motion history images (MHIs) and static history images (SHIs) are created from the front (XOY), side (YOZ) and top (XOZ) projected scenes of each depth sequence in a 3D Euclidean space through engaging the 3D Motion Trail Model (3DMTM). Then, the Local Binary Patterns (LBPs) algorithm is operated on the MHIs and SHIs to learn motion and static hierarchical features to represent the action sequence. The motion and static hierarchical feature vectors are then fed into a classifier ensemble to classify action classes, where the ensemble comprises of two classifiers. Thus, each ensemble includes a pair of Kernel-based Extreme Learning Machine (KELM) or |${\mathrm{l}}_{\mathrm{2}}$|-regularized Collaborative Representation Classifier (⁠|${\mathrm{l}}_{\mathrm{2}}$|-CRC) or Multi-class Support Vector Machine. To extensively assess the framework, we perform experiments on a couple of standard available datasets such as MSR-Action3D, UTD-MHAD and DHA. Experimental consequences demonstrate that the proposed approach gains a state-of-the-art recognition performance in comparison with other available approaches. Several statistical measurements on recognition results also indicate that the method achieves superiority when the hierarchical features are adopted with the KELM ensemble. In addition, to ensure real-time processing capability of the algorithm, the running time of major components is investigated. Based on machine dependency of the running time, the computational complexity of the system is also shown and compared with other methods. Experimental results and evaluation of the computational time and complexity reflect real-time compatibility and feasibility of the proposed system. 1. Introduction Human action categorization is a dynamic research problem. There are broad implementations of human action recognition (HAR) in our daily life. Among these applications, surveillance systems, content-based video search, health care, video analysis, physical rehabilitation, robotics and human–computer interaction are the most illustrious [1–7]. HAR has a prominent application to detect many precarious events or monitor a disabled person living alone. Fall detection is also another important application of HAR, which is extensively helpful for patients monitoring system (e.g. [8]). Despite the many attempts reported for HAR, it remains a challenging task. The main challenges can be associated with cluttered backgrounds, various illumination conditions, occlusion, stationary or moving cameras, variation in human presence, variation in clothes and shape, inter-class similarity and intra-class variations. Previously, action recognition process was accomplished by using the conventional RGB/color-image sensors. But these sensors suffer under occlusions, illumination changes and background clutters [9]. However, the introduction of apropos depth video cameras (e.g. Microsoft Kinect) has facilitated the research progress on action recognition and helped in addressing the aforementioned shortcomings in comparison with the RGB sensors. The depth sensor captures the depth data and RGB data simultaneously. Compared to color data, depth data remain indifferent regarding the illumination situations and textures’ variations of actors [10]. Depth data yield the 3D body shape, structure and motion information of an individual in depth images. Besides, the foreground partition and extraction in an image are effectively simpler in the depth data as compared to the RGB data [11]. Furthermore, to assist the recognition system development, the human skeleton is configured from depth frames straightforwardly [12, 13]. Main contributions: From our discussion, it is clear that the depth sensor-oriented action identification methods can achieve preeminence over the RGB sensor-based methods. For this reason, in this work, we address the action classification problem through depth sensor. More precisely, we employ the 3D Motion Trail Model (3DMTM) for each depth action video to form the motion history images (MHIs) and static history images (SHIs). The images are extracted along three 2D projection views that are front (XOY), side (YOZ) and top (XOZ) view in a 3D Euclidean space corresponding to each depth sequence. Then, to extract features, the Local Binary Patterns (LBPs) algorithm is employed on the MHIs and SHIs about the motion and static fashion of objects. The MHIs- and SHIs-based texture feature extraction technique actually follows a hierarchical strategy, and thus the features are referred to as motion hierarchical features (MHFs) and static hierarchical features (SHFs). After that, MHF and SHF sets are individually passed to a classifier ensemble, in which the ensemble consists of either two Kernel-based Extreme Learning Machine (KELM) algorithms or two |${\mathrm{l}}_{\mathrm{2}}$|-Regularized Collaborative Representation Classifiers (⁠|${\mathrm{l}}_{\mathrm{2}}$|-CRC) or two Multi-class Support Vector Machine (MSVM) classifiers to obtain recognition outcomes for both of those feature sets. We propose the use of Logarithmic Opinion Tool (LOGP) to fuse the recognition outcomes and get the class label of the query sample. The recognition results regarding multiple classifier ensembles are compared and analyzed through different statistical tools. To evaluate our system, all the experiments are conducted on MSR-Action3D [14], UTD-MHAD [15] and DHA [16] dataset. In our work, we employ the renowned and frequently utilized LBPs, KELM, |$l_2$|-CRC, MSVM and LOGP algorithms with their optimal implementation strategy. Specifically, we concentrate on the optimal feature representation strategy after feature extraction by the LBPs method. The optimal feature representation approach is discussed in Section 3.1.3 and is shown by Fig. 4. In fact, the feature illustration plays a key role to achieve the state-of-the-art performance in comparison with the existing methods as well as surpass the performance of methods which have employed similar classification techniques (see the discussion in Section 4.2). The methods, which used similar strategies, captured features from motion images only along the three projection views and represented them separately through three vectors. In comparison, we incorporate the three motion feature vectors in a single vector as MHF, to represent the motion feature/information of actors. Furthermore, our system combines the three feature vectors (about the three projection views) gained from motionless images and makes the SHF vector to describe the static information of actors. Overall, unlike other methods, we formulate a new feature representation technique to describe the motion and static action cues and use this new feature space for optimistic action classification task. More discussion on this context is given at the end of Section 2. We summarize the key contributions of our work as follows: We formulate a new feature representation technique for describing the motion and static action information. For each depth action video, we construct an MHF vector and an SHF vector. The MHF vector represents an action through the motion information, and the SHF describes an action by motionless information corresponding to an actor’s body parts. The motivation for using both the features set is that the SHFs also contain information significant for recognition purpose. The higher recognition accuracy of our method over the motion image based methods (e.g. Depth Motion Map (DMM)/MHI -based handcrafted and deep learning methods) supports the idea that the static image-based action features should be captured besides the features from motion images. Based on three publicly available datasets, MSR-Action3D [14], UTD-MHAD [15] and DHA [16], the introduced approach is assessed and compared with handcrafted and deep learning approaches. The extensive comparison indicates the superiority of the introduced system over the handcrafted and deep learning systems. The computational efficiency of our method is observed by computing the running time and computational complexity. The computational time and complexity values, as presented in Tables 7 and 8, indicate that the proposed approach offers feasibility for real-time utilization. The remaining of the paper is organized as follows. Section 2 discusses the depth data-related action recognition system. The proposed method is presented in detail in Section 3. The comprehensive experimental evaluations and discussions are given in Section 4. The computational efficiency of the proposed framework is described in Section 5. Finally, the conclusion and limitations of this work are reported in Section 6 along with our future research plan. 2. Related Work Our approach addresses the action categorization task on depth action video clips. The existing approaches based on depth data could be categorized into four groups, namely, depth data-based approach, depth data-based skeleton approach, depth and other data features fusion-based approach and depth data-based deep learning approach. The work by Chen et al. [17] provides a more detailed review of the previous approaches. Depth data-based approach: Researchers in computer vision have addressed the task of activity recognition by representing actions through straightforward features from depth map sequences/depth videos. For example, Li et al. [14] gathered 3D points from each depth image of a depth video. Vieira et al. [18] utilized a technique called STOP to gain the spatial and temporal information of a depth action. Wang et al. [19] also used another strategy named ROP characterize human actions. By the concepts of motion energy images (MEIs) [20] as well as MHIs, the DMMs were constructed to represent depth actions in [21]. The method also employed the Histogram of Oriented Gradients (HOGs) on DMMs to describe actions more compactly. Later, Chen et al. [22] thought of few limitations of the method and improved it considerably. To develop novel features-based recognition system, DCSF features were proposed to figure out the 3D depth cuboid in an action [23]. Beyond capturing the shape and motion cues individually, Oreifej and Liu [24] explored integrated shape-motion cues by introducing the Histogram of Oriented 4D (HON4D) normal strategy. Two years later the simplified DMM by [22], some researchers introduced several enhanced recognition systems. Those systems basically extracted texture and shape features. For instance, the LBPs [25] algorithm was operated on DMMs to represent depth human action sequences [26]. Bulbul et al. also developed DMMs based action classification systems in [27–29]. A hybrid action classification framework was described by [30] through combining SDM and BSM feature vectors. As a novel approach, the Histogram of Oriented 3D Gradients (HOG3D) was utilized on the block cells of depth sub-sequences to capture the local spatio-temporal information of action [31]. Furthermore, Kong et al. [32] employed the hierarchical 3D kernel descriptors in their recognition framework. For improving DMMs [22] -based approach, a depth map sequence was segmented into several segments and DMMs were computed corresponding to each segment [33]. The LBP texture features were then extracted on DMMs. Besides, the system was elevated by extracting auto-correlation features locally o DMMs [34, 35]. Zhang et al. [36] computed 3DHoTs features to further propose a novel recognition method. Depth data-based skeleton approach: Researchers have mapped depth videos into skeleton videos. In this way, skeleton as well as skeleton joints have been accessed to describe an action compactly. The human skeleton-oriented approaches are further categorized into the skeleton joint and body part relevant approaches. The body part involved systems can be developed on RGB videos easily (e.g. [37]) but the skeleton joint-based methods are hard in RGB-based action data. In comparison, the skeleton joints-based approach developing is relatively easier for depth action videos than the RGB videos. Overall, skeleton-based methods including skeleton joint-based and body part-based methods can be developed spontaneously from depth data. For example, Fengjun et al. [38] introduced a skeleton joint-based action recognition system by decomposing the high-dimensional 3D joint space into a set of feature spaces. Each feature was associated with the motion of a single joint or combination of corresponding multiple joints. Besides, Xia et al. [39] proposed HOJ3D features to characterize multiple human actions. As another approach, Azary et al. [40] applied sparse representations to construct the scale and position invariant features, i.e. spatial–temporal kinematic joint features. Moreover, by Yang et al. [41], locations of human skeleton joints, temporal displacement of joints and offset of the joints with respect to the initial frame of the human skeleton were utilized to represent human actions. The work was then considerably extended by [42]. By the method in Chaudhry et al. [43], the skeleton of an actor was split into several segments following a spatio-temporal hierarchical structure. Here, each skeleton part describes the motion of a group of joints at a specific temporal scale, and actions are characterized by the hierarchical collection of Linear Dynamical Systems (LDSs; an LDS is a model of a skeleton part). In [44], the Moving Pose (MP) descriptor was proposed to develop a simple, fast and powerful recognition system using skeleton data. The descriptor basically captured the speed and acceleration of the human body joints besides the pose information [44]. To represent human actions, 3D joint locations of human skeletons were extracted and the joint trajectories were modeled by employing a temporal hierarchy of co-variance descriptors [45]. To improve the skeleton joint-based recognition system and to decide the optimal skeleton–joint subset, the genetic algorithm-dependent evolutionary algorithm was developed by Chaaraoui et al. [46]. To develop a body part-based method, Vemulapalli et al. [47] presented human action as a curve in a Lie group. The curve contains each human skeleton of a skeletal action sequence as a point. The action representation actually modeled the 3D geometric correlations among different body parts by utilizing rotations and translations. Depth and other data features fusion-based approach: To enhance the recognition system, the depth data-based features can be incorporated with RGB and skeleton features. Following the concept, Gao et al. [48] introduced a difference MHI for depth and color action video. Next, they captured the human motion through multi-perspective projections. The expected human motion was then characterized by employing the pyramid HOG. The multi-perspective and multi-modality descriptors were then combined by multi-perspective and multi-modality cooperative description and identification model. On the other hand, Zhang et al. utilized 4D local spatio-temporal features to connect the depth and intensity information [49]. The skeleton joints extracted from depth data are also coupled with features obtained from RGB data. As an instance, Luo et al. [50] presented a combination of 3D joint and spatio-temporal features for an RGB action video. The human skeleton joint features are fused with features obtained from depth in [51]. Since the fusion of features of depth data with RGB/skeleton data features have increased the recognition rates significantly, researchers are convinced to employ features jointly from depth, RGB and skeleton. As a result, Sung et al. combined HOG feature extracted from RGB and depth data with skeleton joint positions in [52]. To further improve action classification system in a diverse technique, Chen et al. associated the features from depth data and inertial sensor’s (accelerometer) data in [1, 15]. Depth data-based deep learning approach: Unlike the handcrafted feature based methods, deep learning models learn high-level semantic features from raw data by taking advantage of the multiple layers in the model [53, 54]. In [55], Wang et al. proposed an effective and computationally inexpensive deep model with a small-scale CNN. In [56], the DMM Pyramid was introduced for developing depth data-oriented action recognition deep model. Keceli et al. [57] extracted high-level deep features from depth sequences by using the 3D and 2D CNN and combined them to enhance the recognition system with SVM. More precisely, they considered the 3D volume representations by capturing the temporal action features and the 3D representations were passed to the 3D CNN to obtain deep features which are responsible for the action arrangement. Besides, 2D representations were taken to gather deep features from a pre-trained CNN by the transfer learning strategy. The 2D and 3D CNN features thus obtained were concatenated to empower the proposed system. Azad et al. [58] fed the Weighted Depth Motion Map (WDMM) to the 2DCNN in their recognition system. Zhang et al. [59] described a potential DNNs-based method for developing a deep model in action recognition. From the above comprehensive survey on depth image-related action recognition systems, it is evident that DMMs provide a good choice for action recognition framework. We aim to develop the recognition system by using features extracted from depth data only. Specifically, our approach further improves the DMMs-related works. After reviewing methods described in [21, 22, 26–29, 33–35] and in [36], we point out a potential shortcoming in feature representation. In those methods, the features were extracted from DMMs obtained from depth map sequence, which gather the motion information only. But the static information is also important to alleviate the intra-class variability and inter-class similarity issues. The stationary information basically includes the monotonous events and monotonous motionless attitudes of subjects. However, the motion information was completely absent in those methods. On the other hand, the methods in [26] proposed the decision-level fusion of feature vectors, and the three motion vectors were passed to the three KELM algorithms individually. Compared to this, we propose that the motion feature vectors should be represented and passed to a classifier as a single vector since the system detects motion body postures along all directions at the same time. Therefore, in our work, unlike the above systems, we hypothesize that an action sequence should be represented through the motion and motionless features simultaneously instead of the motion features only. Furthermore, all the motion feature vectors should work in integrated situation rather than representing an action independently. Similarly, all the static feature vectors should be concatenated as a single feature vector to represent an action along all stationary situations of the actors. In accordance with the concept, in our work, we combine the motion feature vectors into a single feature vector as MHF and the static feature vectors into a single vector as SHF. The computed MHF and SHF feature vectors are fed to two KELM algorithms separately, and two classification decisions are then fused. Notice that, the proposed approach deals with the decision-level fusion-based recognition system since the feature-level fusion-based system is computationally expensive and would not necessarily offer consistency for real-time implementation. 3. Proposed Recognition Method Our recognition system is mainly consisting of three basic stages: the hierarchical feature extraction, action classification with multiple classifiers and classification outcomes pooling. All of these stages are discussed in detail in the following segments. The flowchart of our proposed method is shown in Fig. 1. Fig. 1. Open in new tabDownload slide Flowchart of our proposed system. Fig. 1. Open in new tabDownload slide Flowchart of our proposed system. Fig. 2. Open in new tabDownload slide Example of MHIs and SHIs corresponding to the horizontal wave action. Fig. 2. Open in new tabDownload slide Example of MHIs and SHIs corresponding to the horizontal wave action. Fig. 3. Open in new tabDownload slide Example of LBP coded MHI for hand catch action. Fig. 3. Open in new tabDownload slide Example of LBP coded MHI for hand catch action. Fig. 4. Open in new tabDownload slide Graphical illustration of MHF and SHF calculation. Fig. 4. Open in new tabDownload slide Graphical illustration of MHF and SHF calculation. Fig. 5. Open in new tabDownload slide Misclassifications due to inter-class similarity. Fig. 5. Open in new tabDownload slide Misclassifications due to inter-class similarity. 3.1. Feature extraction In this section, we first take a look at the MHIs and SHIs construction through the 3DMTM algorithm and at the texture descriptor LBP. The hierarchical feature extraction strategy is then discussed comprehensively. 3.1.1. Motion and static history image generation The MHI [20] describes the human motion information by storing all the motion segments of all video frames in a 2D image. The MHI thus obtained does not contain any information of motionless body parts as well as monotonous movements of an actor. Due to the deficiency of the above complementary information in the MHI, the action representation becomes inferior. Therefore, to ensure the potential action description, the MHIs and SHIs are computed by employing the 3DMTM [60] on an action video. The calculation scheme of MHIs and SHIs through the 3DMTM is shown in Fig. 2. The 3DMTM actually symbolizes human actions in 3D space and from where the MHIs, |$MHI_{YOZ} $| and |$MHI_{XOZ} $| are generated along the front (XOY), side (YOZ) and top (XOZ) projection views (2D) respectively. Similarly, SHIs |$SHI_{XOY} $|⁠, |$SHI_{YOZ} $| and |$SHI_{XOZ} $| are obtained by the 3DMTM along the aforementioned views. In the 3DMTM, along all views, functions|${\ \varphi } _M(x,,y,,t)$|⁠) and |${\ \varphi } _S(x,\ y,\ t)$| are used to locate the motion and motionless domains in each video frame, respectively. Mathematical formulations of these functions are as follows: $$\begin{equation} \varphi_M (x,y,t)= \begin{cases} 1, & \textrm{if} d_t>{\zeta}_M\\ 0, & \textrm{otherwise} \end{cases}, \end{equation}$$ (1) $$\begin{equation} \varphi_S (x,y,t)= \begin{cases} 1, & \textrm{when} I_t-d_t>{\zeta}_S\\ 0, & \text{otherwise,} \end{cases} \end{equation}$$ (2) where |${\ I_t=\{I_j\}}^N_{j=1}$| is the sequence of |$N$| depth frames/images and the |$d_t={\{d_k\}}^{N-1}_{k=1}$| represents the sequence of differences corresponding to each consecutive pair of |$N$| depth images. The |${\ \zeta } _M$| and |${\zeta } _S$| are thresholds to represent motion and motionless facts in two successive depth images. By using Equation (1), the MHI |$F_M(x,\ y,\ t)$| is computed as follows: $$\begin{equation} F_M(x,y,t)= \begin{cases} N, & \textrm{if} \varphi_M(x,y,t)=1\\ F_M(x,y,t-1)-1, & \text{otherwise.} \end{cases} \end{equation}$$ (3) Similarly, Equation (2) provides the SHI |$F_S(x,\ y,\ t)$| as below: $$\begin{equation} F_S(x,y,t)= \begin{cases} N, & \textrm{if} \varphi_S(x,y,t)=1\\ F_S(x,y,t-1)-1, & \text{otherwise.} \end{cases} \end{equation}$$ (4) 3.1.2. Feature extraction on motion and static history images In the above section, we obtain the MHIs and SHIs corresponding to depth video sequence. This section extracts the texture features from the MHIs and SHIS through the LBP [25] operator. The operator was originally introduced in 1994 by Ojala et al. [61, 62]. The descriptor assigns label to an image pixel by taking the threshold operation with the pixel’s neighborhood. For an intuitive description, let us consider any gray level |$g_m$| in MHI/SHI and |$g_{n\epsilon \{0,\ 1,2,\dots ,N-1\}}$| is one of its neighbors of neighborhoods with |$N$| circular pattern neighbors. In the pattern, the location of each neighbor about |$g_m$| is|$\ R$|. If |$(x,y)$| indicates the position of the center value |$g_{m}$| then the coordinates of a neighbor |$g_n$| are as|$\ (x+R\textrm{cos}(2\pi n/N),y+R\textrm{sin}(2\pi n/N))$|⁠,|$\ n\epsilon \left \{0,\ 1,2,\dots ,N-1\right \}$|⁠. The missing gray values of neighbors of |$g_m$| are calculated by bilinear interpolation strategy. Then, the gray level of |$g_m$| can be updated (⁠|$\forall p)$| by the following formulae: $$\begin{equation} {LBP}_{N,R}(x,y)=\sum^{N-1}_{n=0}{G(g_n-g_m\ )2^n}, \end{equation}$$ (5) where $$\begin{equation*} G(h)= \begin{cases} 1, & \textrm{if} {h\ge 0}\\ 0, & \textrm{if} {h<0.} \end{cases} \end{equation*}$$ When Equation (5) is utilized for all pixels in MHI/SHI, the obtained image with new gray labels is referred to as LBP coded MHI/SHI. The LBP coded MHI is shown by Fig. 3 as an example. Note that, according to Equation (5), all pixels are initially labeled by binary numbers and they are transformed to decimal form in generating the LBP coded image. However, the LBP coded MHI/SHI is divided into several blocks (overlapping/non-overlapping) and block-wise histograms are calculated to describe the texture information of a |$P\times Q$| size MHI/SHI as $$\begin{equation} T\left(v\right)=\sum^P_{p=1}{\sum^Q_{q=1}{h\left({LBP}_{N,R}\left(p,q\right),v\right)}}, \end{equation}$$ (6) where $$v\epsilon \left [0,V\right ] \textrm{and} h \left (r,s\right )=\left \{ \begin{array}{c} 1,\ \ r=s \\ 0,\ \ else\ \ \end{array} \right ._{.}$$ The above |$V$| stands for the highest LBP code in LBP coded MHI/SHI. However, the previous LBP was modified by adding the concept of uniform LBP pattern [25]. The term uniform LBP pattern points to the involvement of configuration of binary pattern in |$0\sim 2$| bitwise |$0/1$| or vice versa spatial switching. The LBP code ‘|$01110000$|’ is an example of a uniform LBP pattern with |$2$| transitions whereas the code ‘|$1100100$|’ is not uniform for having |$4$| transitions. The status of a pattern can be checked by $$\begin{equation*}U\left({LBP}_{N,R}\left(x,y\right)\right)=\left|G\left(g_{n-1}-g_m\ \right)-G\left(g_0-g_m\right)\right|\end{equation*}$$ $$\begin{equation} +\sum^{N-1}_{n=1}{\left|G\left(g_n-g_m\ \right)-G(g_{n-1}-g_m)\right|}\ \, \end{equation}$$ (7) where the LBP pattern is uniform when|$\ U\le 2$|⁠. The block-wise uniform patterns could be transformed to rotation invariant by $$\begin{equation} {LBP}^{rotiu}_{N,R}\left(x,y\right)= \begin{cases} \sum^{N-1}_{n=1}{|F_n|}, & \textrm{if} {U\left({LBP}_{N,R}(x,y)\right)\le 2}\\ N+1, & \text{otherwise,} \end{cases} \end{equation}$$ (8) where |$F_n= G(g_n-g_m)-G(g_{n-1}-g_m)$| and |${LBP}^{rotiu}_{N,R}(x,y)$| denotes the rotation invariant local uniform pattern. 3.1.3. Motion and static hierarchical feature representation LBP coded MHI and SHI images corresponding to front (XOY), side (YOZ) and top (XOZ) views are divided into |$4\times 2$|⁠, |$4\times 3$| and |$3\times 2$| blocks, respectively [29]. The histograms on all the images are calculated on overlapped blocks with |$50\%$| overlapping between two sequential blocks. The fusion of all the block-based histograms outputs the MHI/SHI texture representation vector. Thus, the computed feature vectors over the MHI|$_{XOY}{}$| and SHI|$_{XOY}$| are referred to as MHI|$_{XOY}$|-LBP and SHI|$_{XOY}$|-LBP, respectively. It should be noted that we firstly compute the front (XOY) projections of all depth frames of an action video and then MHI|$_{XOY}$| and SHI|$_{XOY}$| are obtained from those projections by employing the 3DMTM algorithm. Next, the MHI|$_{XOY}$|-LBP and SHI|$_{XOY}$|-LBP features vectors are obtained from the above MHI|$_{XOY}$| and SHI|$_{XOY}$| images. From above, it is clear that MHI|$_{XOY}$|-LBP and SHI|$_{XOY}$|-LBP feature vectors are obtained following a hierarchical way and therefore we call the MHI|$_{XOY}$|-LBP and SHI|$_{XOY}$|-LBP as motion and static hierarchical features, respectively. However, along the side (YOZ) and top (XOZ) projections, we also compute the motion and static hierarchical feature vectors. The motion hierarchical features MHI|$_{XOY}$|-LBP, MHI|$_{YOZ}$|-LBP and MHI|$_{XOZ}$|-LBP are concatenated and denoted as MHF, i.e. MHF = [MHI|$_{XOY}$|-LBP, MHI|$_{YOZ}$|-LBP, MHI|$_{XOZ}$|-LBP] and the concatenated version of the SHFs SHI|$_{XOY}$|-LBP, SHI|$_{YOZ}$|-LBP and SHI|$_{XOZ}$|-LBP is represented as SHF, i.e. SHF = [SHI|$_{XOY}$|-LBP, SHI|$_{YOZ}$|-LBP, SHI|$_{XOZ}$|-LBP]. Figure 4 illustrates the hierarchical feature generation scheme. Table 1 Recognition accuracy comparison on MSR-Action3D dataset. Method Year Overall Accuracy (%) Yang et al. [21] (DMM-HOG) 2012 88.7 Yang et al. [41] (Relative Joint Positions) 2012 82.3 Vieira et al. [18] (STOP) 2012 84.8 Wang et al. [19] (Random Occupancy Pattern) 2012 86.5 Xia et al. [23] (DCSF) 2013 89.3 Oreifej et al. [24] (HON4D) 2013 88.9 Zanfir et al. [44] (Moving Pose) 2013 91.7 Tian et al. [11] (SNV) 2014 93.1 Vemulapalli et al. [47] (Skeletons Lie group) 2014 89.5 Chen et al. [26] (DMM-LBP-DF) 2015 93.0 Yang et al. [56] (2D-CNN with DMM Pyramid) 2015 91.1 Yang et al. [56] (3D-CNN with DMM Pyramid) 2015 86.1 Rahmani et al. [31] (HOG3D+LLC) 2015 90.9 Kong et al. [32] (Hierarchical 3D Kernel) 2015 92.7 Liang et al. [76] (Subspace encoding) 2016 94.06 Liu et al. [77] (LSTM+Trust Gates) 2016 94.8 Yang et al. [78] (Extended SNV) 2017 93.5 Liu et al. [79] (Trust Gates) 2017 94.8 Weng et al. [80] (ST-NBNN) 2017 94.8 Maryam et al. [81] (SSTKDes) 2017 95.6 Chen et al. [35] (DMM-GLAC-STACOG) 2017 94.8 Zhang et al. [36] (3DHoT-MBC) 2017 95.2 Keceli et al. [57] (3D-CNN+DHI+relief+SVM) 2018 92.8 Azad et al. [58] (WDMM+HOG) 2018 91.9 Azad et al. [58] (WDMM+LBP) 2018 91.6 Azad et al. [58] (WDMM+CNN) 2018 90.0 Zhang et al. [59] (Deep Activations) 2018 92.3 Zhang et al. [59] (Deep Activations+Attributes) 2018 93.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 95.6 Bulbul et al. [83] (GMHI+GSHI+CRC) 2019 94.5 $$\left . \begin{array}{c} \boldsymbol{MSVM} \\ \ {\boldsymbol{l}}_{\boldsymbol{2}}\mathrm{-}\boldsymbol{\textrm{CRC}} \\ \boldsymbol{KELM} \end{array} \right \}\boldsymbol{Our}\ \boldsymbol{methods}$$ 2019 $$\begin{array}{}92.31 \\ 94.87 \\ \textbf{95.97} \end{array}$$ Method Year Overall Accuracy (%) Yang et al. [21] (DMM-HOG) 2012 88.7 Yang et al. [41] (Relative Joint Positions) 2012 82.3 Vieira et al. [18] (STOP) 2012 84.8 Wang et al. [19] (Random Occupancy Pattern) 2012 86.5 Xia et al. [23] (DCSF) 2013 89.3 Oreifej et al. [24] (HON4D) 2013 88.9 Zanfir et al. [44] (Moving Pose) 2013 91.7 Tian et al. [11] (SNV) 2014 93.1 Vemulapalli et al. [47] (Skeletons Lie group) 2014 89.5 Chen et al. [26] (DMM-LBP-DF) 2015 93.0 Yang et al. [56] (2D-CNN with DMM Pyramid) 2015 91.1 Yang et al. [56] (3D-CNN with DMM Pyramid) 2015 86.1 Rahmani et al. [31] (HOG3D+LLC) 2015 90.9 Kong et al. [32] (Hierarchical 3D Kernel) 2015 92.7 Liang et al. [76] (Subspace encoding) 2016 94.06 Liu et al. [77] (LSTM+Trust Gates) 2016 94.8 Yang et al. [78] (Extended SNV) 2017 93.5 Liu et al. [79] (Trust Gates) 2017 94.8 Weng et al. [80] (ST-NBNN) 2017 94.8 Maryam et al. [81] (SSTKDes) 2017 95.6 Chen et al. [35] (DMM-GLAC-STACOG) 2017 94.8 Zhang et al. [36] (3DHoT-MBC) 2017 95.2 Keceli et al. [57] (3D-CNN+DHI+relief+SVM) 2018 92.8 Azad et al. [58] (WDMM+HOG) 2018 91.9 Azad et al. [58] (WDMM+LBP) 2018 91.6 Azad et al. [58] (WDMM+CNN) 2018 90.0 Zhang et al. [59] (Deep Activations) 2018 92.3 Zhang et al. [59] (Deep Activations+Attributes) 2018 93.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 95.6 Bulbul et al. [83] (GMHI+GSHI+CRC) 2019 94.5 $$\left . \begin{array}{c} \boldsymbol{MSVM} \\ \ {\boldsymbol{l}}_{\boldsymbol{2}}\mathrm{-}\boldsymbol{\textrm{CRC}} \\ \boldsymbol{KELM} \end{array} \right \}\boldsymbol{Our}\ \boldsymbol{methods}$$ 2019 $$\begin{array}{}92.31 \\ 94.87 \\ \textbf{95.97} \end{array}$$ Open in new tab Table 1 Recognition accuracy comparison on MSR-Action3D dataset. Method Year Overall Accuracy (%) Yang et al. [21] (DMM-HOG) 2012 88.7 Yang et al. [41] (Relative Joint Positions) 2012 82.3 Vieira et al. [18] (STOP) 2012 84.8 Wang et al. [19] (Random Occupancy Pattern) 2012 86.5 Xia et al. [23] (DCSF) 2013 89.3 Oreifej et al. [24] (HON4D) 2013 88.9 Zanfir et al. [44] (Moving Pose) 2013 91.7 Tian et al. [11] (SNV) 2014 93.1 Vemulapalli et al. [47] (Skeletons Lie group) 2014 89.5 Chen et al. [26] (DMM-LBP-DF) 2015 93.0 Yang et al. [56] (2D-CNN with DMM Pyramid) 2015 91.1 Yang et al. [56] (3D-CNN with DMM Pyramid) 2015 86.1 Rahmani et al. [31] (HOG3D+LLC) 2015 90.9 Kong et al. [32] (Hierarchical 3D Kernel) 2015 92.7 Liang et al. [76] (Subspace encoding) 2016 94.06 Liu et al. [77] (LSTM+Trust Gates) 2016 94.8 Yang et al. [78] (Extended SNV) 2017 93.5 Liu et al. [79] (Trust Gates) 2017 94.8 Weng et al. [80] (ST-NBNN) 2017 94.8 Maryam et al. [81] (SSTKDes) 2017 95.6 Chen et al. [35] (DMM-GLAC-STACOG) 2017 94.8 Zhang et al. [36] (3DHoT-MBC) 2017 95.2 Keceli et al. [57] (3D-CNN+DHI+relief+SVM) 2018 92.8 Azad et al. [58] (WDMM+HOG) 2018 91.9 Azad et al. [58] (WDMM+LBP) 2018 91.6 Azad et al. [58] (WDMM+CNN) 2018 90.0 Zhang et al. [59] (Deep Activations) 2018 92.3 Zhang et al. [59] (Deep Activations+Attributes) 2018 93.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 95.6 Bulbul et al. [83] (GMHI+GSHI+CRC) 2019 94.5 $$\left . \begin{array}{c} \boldsymbol{MSVM} \\ \ {\boldsymbol{l}}_{\boldsymbol{2}}\mathrm{-}\boldsymbol{\textrm{CRC}} \\ \boldsymbol{KELM} \end{array} \right \}\boldsymbol{Our}\ \boldsymbol{methods}$$ 2019 $$\begin{array}{}92.31 \\ 94.87 \\ \textbf{95.97} \end{array}$$ Method Year Overall Accuracy (%) Yang et al. [21] (DMM-HOG) 2012 88.7 Yang et al. [41] (Relative Joint Positions) 2012 82.3 Vieira et al. [18] (STOP) 2012 84.8 Wang et al. [19] (Random Occupancy Pattern) 2012 86.5 Xia et al. [23] (DCSF) 2013 89.3 Oreifej et al. [24] (HON4D) 2013 88.9 Zanfir et al. [44] (Moving Pose) 2013 91.7 Tian et al. [11] (SNV) 2014 93.1 Vemulapalli et al. [47] (Skeletons Lie group) 2014 89.5 Chen et al. [26] (DMM-LBP-DF) 2015 93.0 Yang et al. [56] (2D-CNN with DMM Pyramid) 2015 91.1 Yang et al. [56] (3D-CNN with DMM Pyramid) 2015 86.1 Rahmani et al. [31] (HOG3D+LLC) 2015 90.9 Kong et al. [32] (Hierarchical 3D Kernel) 2015 92.7 Liang et al. [76] (Subspace encoding) 2016 94.06 Liu et al. [77] (LSTM+Trust Gates) 2016 94.8 Yang et al. [78] (Extended SNV) 2017 93.5 Liu et al. [79] (Trust Gates) 2017 94.8 Weng et al. [80] (ST-NBNN) 2017 94.8 Maryam et al. [81] (SSTKDes) 2017 95.6 Chen et al. [35] (DMM-GLAC-STACOG) 2017 94.8 Zhang et al. [36] (3DHoT-MBC) 2017 95.2 Keceli et al. [57] (3D-CNN+DHI+relief+SVM) 2018 92.8 Azad et al. [58] (WDMM+HOG) 2018 91.9 Azad et al. [58] (WDMM+LBP) 2018 91.6 Azad et al. [58] (WDMM+CNN) 2018 90.0 Zhang et al. [59] (Deep Activations) 2018 92.3 Zhang et al. [59] (Deep Activations+Attributes) 2018 93.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 95.6 Bulbul et al. [83] (GMHI+GSHI+CRC) 2019 94.5 $$\left . \begin{array}{c} \boldsymbol{MSVM} \\ \ {\boldsymbol{l}}_{\boldsymbol{2}}\mathrm{-}\boldsymbol{\textrm{CRC}} \\ \boldsymbol{KELM} \end{array} \right \}\boldsymbol{Our}\ \boldsymbol{methods}$$ 2019 $$\begin{array}{}92.31 \\ 94.87 \\ \textbf{95.97} \end{array}$$ Open in new tab Table 2 Performance of KELM, |$l_2$|-CRC and MSVM for MSR-Action3D dataset. Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9582 0.9618 0.9587 0.9978 0.9597 0.9576 |$l_2$|-CRC 0.94787 0.9635 0.9448 0.9971 0.9487 0.9460 MSVM 0.9214 0.9348 0.9225 0.9957 0.9231 0.9190 Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9582 0.9618 0.9587 0.9978 0.9597 0.9576 |$l_2$|-CRC 0.94787 0.9635 0.9448 0.9971 0.9487 0.9460 MSVM 0.9214 0.9348 0.9225 0.9957 0.9231 0.9190 Open in new tab Table 2 Performance of KELM, |$l_2$|-CRC and MSVM for MSR-Action3D dataset. Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9582 0.9618 0.9587 0.9978 0.9597 0.9576 |$l_2$|-CRC 0.94787 0.9635 0.9448 0.9971 0.9487 0.9460 MSVM 0.9214 0.9348 0.9225 0.9957 0.9231 0.9190 Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9582 0.9618 0.9587 0.9978 0.9597 0.9576 |$l_2$|-CRC 0.94787 0.9635 0.9448 0.9971 0.9487 0.9460 MSVM 0.9214 0.9348 0.9225 0.9957 0.9231 0.9190 Open in new tab Fig. 6. Open in new tabDownload slide Confusion matrix on MSR-Action3D dataset for KELM classifier ensemble. Fig. 6. Open in new tabDownload slide Confusion matrix on MSR-Action3D dataset for KELM classifier ensemble. Fig. 7. Open in new tabDownload slide Confusion matrix on MSR-Action3D dataset for |$l_2$|-CRC classifier ensemble. Fig. 7. Open in new tabDownload slide Confusion matrix on MSR-Action3D dataset for |$l_2$|-CRC classifier ensemble. Fig. 8. Open in new tabDownload slide Confusion matrix on MSR-Action3D dataset for MSVM classifier ensemble. Fig. 8. Open in new tabDownload slide Confusion matrix on MSR-Action3D dataset for MSVM classifier ensemble. Fig. 9. Open in new tabDownload slide Class-specific accuracy for KELM, |$l_2$|-CRC and MSVM on MSR-Action3D dataset. Fig. 9. Open in new tabDownload slide Class-specific accuracy for KELM, |$l_2$|-CRC and MSVM on MSR-Action3D dataset. Table 3 Recognition accuracy comparison on UTD-MHAD dataset. Method Year Overall accuracy (%) Yang et al. [21] (DMM-HOG) 2012 81.5 Chen et al. [15] (Kinect) 2015 66.1 Chen et al. [15] (Inertial) 2015 67.2 Chen et al. [15] (Kinect & Inertial) 2015 79.1 Zhang et al. [36] (3DHoT-MBC) 2017 84.4 Wang et al. [84] (Joint Trajectory + CNN) 2018 85.8.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 81.45 Bulbul et al. [83] (GMHI+GSHI+ CRC) 2019 89.5 McNally et al. [85] (STAR-Net) 2019 90.0 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 83.26\\ 89.07\\ \textbf{90.23} \end{array}$$ Method Year Overall accuracy (%) Yang et al. [21] (DMM-HOG) 2012 81.5 Chen et al. [15] (Kinect) 2015 66.1 Chen et al. [15] (Inertial) 2015 67.2 Chen et al. [15] (Kinect & Inertial) 2015 79.1 Zhang et al. [36] (3DHoT-MBC) 2017 84.4 Wang et al. [84] (Joint Trajectory + CNN) 2018 85.8.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 81.45 Bulbul et al. [83] (GMHI+GSHI+ CRC) 2019 89.5 McNally et al. [85] (STAR-Net) 2019 90.0 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 83.26\\ 89.07\\ \textbf{90.23} \end{array}$$ Open in new tab Table 3 Recognition accuracy comparison on UTD-MHAD dataset. Method Year Overall accuracy (%) Yang et al. [21] (DMM-HOG) 2012 81.5 Chen et al. [15] (Kinect) 2015 66.1 Chen et al. [15] (Inertial) 2015 67.2 Chen et al. [15] (Kinect & Inertial) 2015 79.1 Zhang et al. [36] (3DHoT-MBC) 2017 84.4 Wang et al. [84] (Joint Trajectory + CNN) 2018 85.8.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 81.45 Bulbul et al. [83] (GMHI+GSHI+ CRC) 2019 89.5 McNally et al. [85] (STAR-Net) 2019 90.0 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 83.26\\ 89.07\\ \textbf{90.23} \end{array}$$ Method Year Overall accuracy (%) Yang et al. [21] (DMM-HOG) 2012 81.5 Chen et al. [15] (Kinect) 2015 66.1 Chen et al. [15] (Inertial) 2015 67.2 Chen et al. [15] (Kinect & Inertial) 2015 79.1 Zhang et al. [36] (3DHoT-MBC) 2017 84.4 Wang et al. [84] (Joint Trajectory + CNN) 2018 85.8.4 Nguyen et al. [82] (Hierarchical Gaussian) 2018 81.45 Bulbul et al. [83] (GMHI+GSHI+ CRC) 2019 89.5 McNally et al. [85] (STAR-Net) 2019 90.0 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 83.26\\ 89.07\\ \textbf{90.23} \end{array}$$ Open in new tab Table 4 Performance of KELM, |$l_2$|-CRC and MSVM for UTD-MHAD dataset. Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9028 0.9129 0.8995 0.9961 0.9023 0.8986 |$l_2$|-CRC 0.8912 0.9031 0.8873 0.9956 0.8907 0.8865 MSVM 0.8333 0.8556 0.8333 0.9933 0.8326 0.8261 Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9028 0.9129 0.8995 0.9961 0.9023 0.8986 |$l_2$|-CRC 0.8912 0.9031 0.8873 0.9956 0.8907 0.8865 MSVM 0.8333 0.8556 0.8333 0.9933 0.8326 0.8261 Open in new tab Table 4 Performance of KELM, |$l_2$|-CRC and MSVM for UTD-MHAD dataset. Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9028 0.9129 0.8995 0.9961 0.9023 0.8986 |$l_2$|-CRC 0.8912 0.9031 0.8873 0.9956 0.8907 0.8865 MSVM 0.8333 0.8556 0.8333 0.9933 0.8326 0.8261 Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9028 0.9129 0.8995 0.9961 0.9023 0.8986 |$l_2$|-CRC 0.8912 0.9031 0.8873 0.9956 0.8907 0.8865 MSVM 0.8333 0.8556 0.8333 0.9933 0.8326 0.8261 Open in new tab Table 5 Recognition accuracy comparison on DHA dataset. Method Year Overall accuracy (%) Lin et al. [16] (D-STV/AS) 2012 86.8 Gao et al. [48] (D-DMHI-PHOG) 2015 92.4 Gao et al. [48] (DMPP-PHOG) 2015 95.0 Chen et al. [26] (DMM-LBP-DF) 2015 91.3 Chen et al. [33] (Multi-temporal DMM) 2016 95.44 Zhang et al. [36] (3DHoT-MBC) 2017 96.69 Nguyen et al. [82] (Hierarchical Gaussian) 2018 97.96 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 96.09\\ 98.26 \\ \textbf{98.26}\end{array}$$ Method Year Overall accuracy (%) Lin et al. [16] (D-STV/AS) 2012 86.8 Gao et al. [48] (D-DMHI-PHOG) 2015 92.4 Gao et al. [48] (DMPP-PHOG) 2015 95.0 Chen et al. [26] (DMM-LBP-DF) 2015 91.3 Chen et al. [33] (Multi-temporal DMM) 2016 95.44 Zhang et al. [36] (3DHoT-MBC) 2017 96.69 Nguyen et al. [82] (Hierarchical Gaussian) 2018 97.96 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 96.09\\ 98.26 \\ \textbf{98.26}\end{array}$$ Open in new tab Table 5 Recognition accuracy comparison on DHA dataset. Method Year Overall accuracy (%) Lin et al. [16] (D-STV/AS) 2012 86.8 Gao et al. [48] (D-DMHI-PHOG) 2015 92.4 Gao et al. [48] (DMPP-PHOG) 2015 95.0 Chen et al. [26] (DMM-LBP-DF) 2015 91.3 Chen et al. [33] (Multi-temporal DMM) 2016 95.44 Zhang et al. [36] (3DHoT-MBC) 2017 96.69 Nguyen et al. [82] (Hierarchical Gaussian) 2018 97.96 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 96.09\\ 98.26 \\ \textbf{98.26}\end{array}$$ Method Year Overall accuracy (%) Lin et al. [16] (D-STV/AS) 2012 86.8 Gao et al. [48] (D-DMHI-PHOG) 2015 92.4 Gao et al. [48] (DMPP-PHOG) 2015 95.0 Chen et al. [26] (DMM-LBP-DF) 2015 91.3 Chen et al. [33] (Multi-temporal DMM) 2016 95.44 Zhang et al. [36] (3DHoT-MBC) 2017 96.69 Nguyen et al. [82] (Hierarchical Gaussian) 2018 97.96 $$\left. \begin{array}{c} \textbf{MSVM} \\{\textbf{l}}_{\textbf{2}}\boldsymbol{-}{\textbf{CRC}} \\ \textbf{KELM} \end{array}\right \} $$Ourmethods 2019 $$ \begin{array}{} 96.09\\ 98.26 \\ \textbf{98.26}\end{array}$$ Open in new tab 3.2. Action classification In the above section, we compute the MHF and SHF feature vectors associated with the motion and static posture images. This section firstly passes the MHF and SHF feature vectors to a classifier ensemble which is constructed by two classifiers. Then, the classification outcomes from two classifiers are merged through the decision fusion algorithm named LOGP. The classifier ensemble is formed by either two KELM, or two |$l_2$|-CRC or two MSVM classifiers. Following sections discuss the classification issues in detail. 3.2.1. Overview of KELM The KELM was introduced in [63] by adapting a kernel with the Extreme Learning Machine (ELM) classifier [64] to improve the ELM. Let us elaborate this. We may figure out an action dataset of |$C$| categories. The category of an action could be assigned as |$y_{i} \in \{ 0,1\} $|⁠, where |$i=1,2,\cdots ,C$|⁠. Let us assume that the dataset has |$m$| training action samples |$\{ \boldsymbol{x}_{j},\boldsymbol{y}_{j}\} _{j=1}^{m} $|⁠, where |$\boldsymbol{x}_{j} \in{\mathrm{R}}^{D} $| and |$\boldsymbol{y}_{j} \in{\mathrm{R}}^{C} $|⁠, to train an action model. Now, based on a feed-forward neural network with |$N$| neurons in the hidden layer, the output function is expressed as follows: $$\begin{equation} h_{N} (\boldsymbol{x}_{j} )=\sum _{k=1}^{N}\boldsymbol{\alpha} _{k} f(\boldsymbol{w}_{k}.\boldsymbol{x}_{j} +e_{k} )=\boldsymbol{y}_{j}, \ j \in \{1,2,\ldots,m\}, \end{equation}$$ (9) where |$f(.)$| represents a nonlinear function to activate the hidden neuron, |$\boldsymbol{w}_{k} \in{\mathrm{R}}^{D} $|and |$\boldsymbol{\alpha } _{k} \in{\mathrm{R}}^{C}$| are called the input and output weight vectors associated with |$k^{th}$| hidden neuron and |$e_{k} $| indicates the bias of the node. For all values of |$j$| in Equation (9), we obtain |$m$| number of equations. Those equations could be written in a compact form as follows: $$\begin{equation} \boldsymbol{F\alpha =Y}, \end{equation}$$ (10) where |$\boldsymbol{\alpha } =[\boldsymbol{\alpha } _{1}^{T},\ldots ,\boldsymbol{\alpha } _{m}^{T} ]^{T} \in{\mathrm{R}}^{N\times C} $|⁠, |$\boldsymbol{Y}=[\boldsymbol{y}_{1}^{T},\ldots ,\boldsymbol{y}_{m}^{T} ]^{T} \in{\mathrm{R}}^{m\times C} $| and |$\boldsymbol{F}$| is the output matrix of hidden layer. The precise form |$\boldsymbol{F}$| is represented by $$\begin{equation} \boldsymbol{F}=\left[\begin{array}{c} {\mathrm{\boldsymbol{f}}(\boldsymbol{x}_{1})} \\{\vdots} \\{\mathrm{\boldsymbol{f}}(\boldsymbol{x}_{m})} \end{array}\right]=\left[\begin{array}{ccc} {f(\boldsymbol{w}_{1}.\boldsymbol{x}_{1} +e_{1} )} & {\cdots} & {f(\boldsymbol{w}_{N}.\boldsymbol{x}_{1} +e_{N} )} \\{\vdots} & {\ddots } & {\vdots } \\{f(\boldsymbol{w}_{1}.\boldsymbol{x}_{m} +e_{1} )} & {\cdots } & {f(\boldsymbol{w}_{N}.\boldsymbol{x}_{m} +e_{N} )} \end{array}\right]. \end{equation}$$ (11) Since the fact |$N\ll m$| is regular [64], the solution of Equation (10) is as $$\begin{equation} \boldsymbol{\alpha = {\mathrm{F}}}^{\dagger}\boldsymbol{Y}, \end{equation}$$ (12) where |$\boldsymbol{F}^{\textrm{\dagger }}$| represents the Moore–Penrose generalized inverse of matrix |$F$| and its mathematical formulation is |$\boldsymbol{F}^{\textrm{\dagger }} =\boldsymbol{F}^{T} (\boldsymbol{FF}^{T})^{-1}$|⁠. Furthermore, Equation (9) is changed by the addition of |$\frac{1}{\rho } (\rho>0)$| with the |$\boldsymbol{FF}^{T} $|to enhance the solution as $$\begin{equation} h_{N} (\boldsymbol{x}_{j} )=\boldsymbol{\mathrm{f}}(\boldsymbol{x}_{j} )\boldsymbol{\alpha}=\boldsymbol{\mathrm{f}}(\boldsymbol{x}_{j} )\boldsymbol{F}^{T} (\frac{\boldsymbol{I}}{{\rho}} +\boldsymbol{FF}^{T} )^{-1} \boldsymbol{Y}. \end{equation}$$ (13) Now, when the feature transformation is not clear in ELM then a kernel matrix is employed by way of $$\begin{equation} \Omega _{ELM} =\boldsymbol{FF}^{T}:\Omega _{ELM_{j,s}} =\boldsymbol{\mathrm{f}}(\boldsymbol{x}_{j} ).\boldsymbol{\mathrm{f}}(\boldsymbol{x}_{s} )=K(\boldsymbol{x}_{j},\boldsymbol{x}_{s}). \end{equation}$$ (14) Hence, the KELM provides an output function as $$\begin{equation} h_{N} (\boldsymbol{x}_{j} )=\left[\begin{array}{c} {K(\boldsymbol{x}_{j},\boldsymbol{x}_{1} )} \\{\vdots} \\{K(\boldsymbol{x}_{j},\boldsymbol{x}_{m} )} \end{array}\right]\left(\frac{\boldsymbol{I}}{\rho} +\Omega _{ELM} \right)^{-1} \boldsymbol{Y}. \end{equation}$$ (15) The category of a query action vector |$\boldsymbol{x}_{t} $| could be calculated by $$\begin{equation} y_{t} ={\mathop{arg \ max}\limits_{i=1,2,\ldots,C}} \ \ h_{N} (\boldsymbol{x}_{t} )_{i}, \end{equation}$$ (16) where |$h_{N} (\boldsymbol{x}_{t} )_{i} $| indicates the |$i$|th output of |$h_{N} (\boldsymbol{x}_{t} )=[h_{N} (\boldsymbol{x}_{t} )_{1},\ldots \\ ,h_{N} (\boldsymbol{x}_{t} )_{C} ]^{T} $|⁠. We use the radial basis function (RBF) kernel in ELM. Fig. 10. Open in new tabDownload slide Confusion matrix on UTD-MHAD dataset by using KELM classifier ensemble. Fig. 10. Open in new tabDownload slide Confusion matrix on UTD-MHAD dataset by using KELM classifier ensemble. Fig. 11. Open in new tabDownload slide Confusion matrix on UTD-MHAD dataset by using |$l_2$|-CRC classifier ensemble. Fig. 11. Open in new tabDownload slide Confusion matrix on UTD-MHAD dataset by using |$l_2$|-CRC classifier ensemble. Fig. 12. Open in new tabDownload slide Confusion matrix on UTD-MHAD dataset by using MSVM classifier ensemble. Fig. 12. Open in new tabDownload slide Confusion matrix on UTD-MHAD dataset by using MSVM classifier ensemble. Fig. 13. Open in new tabDownload slide Class specific accuracy for KELM, |$l_2$|-CRC, and MSVM on UTD-MHAD dataset. Fig. 13. Open in new tabDownload slide Class specific accuracy for KELM, |$l_2$|-CRC, and MSVM on UTD-MHAD dataset. 3.2.2. LOGP on KELM ensemble In our problem, we employ two KELM algorithms to construct the KELM ensemble. Then, the classification outcomes from those KELM classifiers are merged by employing the LOGP [65], to compute the label of a query sample. In fact, each KELM yields action class through using the function |$h_{N} (x_{t} )_{i}$| and thus the function is used to gain the probability of a predicted class in the LOGP. The higher value of |$h_{N} (x_{t} )_{i}$| provides the higher probability of a predicted label. However, for all the predicted class labels using the |$h_{N} (x_{t} )_{j}$|⁠, the posterior probabilities are determined as $$\begin{equation} p(y_{i}|\boldsymbol{x}_t)=\frac{1}{1+\exp (Ah_{N} (\boldsymbol{x}_t)_{i} +B)}. \end{equation}$$ (17) By putting |$A=-1$| and |$B=0$| in Equation (17), to facilitate the probability calculation, we obtain $$\begin{equation} p(y_{i} |\boldsymbol{x_t})=\frac{1}{1+\exp (-h_{N} (\boldsymbol{x}_t)_{i})}. \end{equation}$$ (18) The computed probabilities, obtained from the above equation, are applied as below to estimate a global membership function $$\begin{equation} {P\left(y_i|\boldsymbol{x_t}\right)=\prod^L_{q=1}{p_q{(y}_i\ |\ \boldsymbol{x}_t\boldsymbol{)}^{{\beta}_q}}}, \end{equation}$$ (19) where |$L$| indicates the total number of classification algorithms in the classification ensemble (here, |$L=2$|⁠). The term |$\beta _{q}$| represents the weight of the |$q$|th classifier. To ease the calculation of |$P(y_{i}|x_t)$| we can set |$\beta _{q} =1/ /L$|⁠, |$\forall q\in \{ 1,2...,L\} $| in Equation (19). The final action label is obtained by $$\begin{equation} y_t ={\mathop{arg \ max}\limits_{i=1,2,\ldots,C}}\ \ P(y_{i} |\boldsymbol{x}_t). \end{equation}$$ (20) 3.2.3. Overview of |$l_2$|-CRC Notice that the above section uses the supervised learning algorithm, KELM, to recognize human actions. To observe the classification performance of our proposed method with unsupervised algorithm, we adopt the |$l_2$|-CRC [6]. For an intuitive description of the |$l_2$|-CRC, let us consider an action dataset with |$C$| classes. When we arrange the training samples column-wise, we can gain a dictionary |${\boldsymbol{P}\boldsymbol{=[}\boldsymbol{P}}_{\boldsymbol{1}},{\boldsymbol{P}}_{\boldsymbol{2}}\boldsymbol{,\ \dots } {\boldsymbol{\,}\boldsymbol{P}}_{\boldsymbol{C}}\boldsymbol{]=[}{\boldsymbol{p}}_{\boldsymbol{1}},{\boldsymbol{p}}_{\boldsymbol{2}}\boldsymbol{,\ \dots }{\boldsymbol{\,}\boldsymbol{p}}_{\boldsymbol{n}}\boldsymbol{]}\epsilon R^{D\times N}$|⁠, where |$D\ $|denotes the dimension of samples, and |$N$| is the total amount of training samples in the procedure of action recognition. Also |${\boldsymbol{P}}_{\boldsymbol{j}}\in R^{D\times M_j}\mathrm{,}\ \left (j=1,2,\dots ,\ C\right )$| is the subset of the training samples of the |$j^{th}$| class and |$p_i\epsilon R^D(i=1,2,\dots ,n)\ $|is the single training sample. Again, let us describe any new sample |$\boldsymbol{S}\epsilon R^D$| using matrix |$P$| as follows: $$\begin{equation} \boldsymbol{S}=\boldsymbol{P}\boldsymbol{\beta}, \end{equation}$$ (21) where |$\boldsymbol{\beta } $| is an |$N\times 1\ $|vector related by coefficients equivalent to the training samples. Solution for Equation (21) is not trivial as it is typically underdetermined [66]. Generally, we get the solution by solving the following norm minimization problem: $$\begin{equation} \hat{\boldsymbol{\beta}}={ \begin{array}{c} {\mathop{arg \ min}} \\ \boldsymbol{\beta } \end{array} \left\{{\left\|\boldsymbol{S}-\boldsymbol{P}\boldsymbol{\beta }\right\|}^2_2+{\mu \left\|\boldsymbol{A}\boldsymbol{\beta }\right\|}^2_2\right\}}, \end{equation}$$ (22) where the matrix |$\boldsymbol{A}$| denotes the Tikhonov regularization matrix [67] and |$\mu $| denotes the well-known regularization parameter. The term involved with |$\boldsymbol{\mathrm{A}}$| certifies the obligation of prior information about the solution by utilizing the methodology which is described in [68–70], where the training samples that are very different from the test sample are assigned less weight than the training samples which are highly similar. Finally, the arrangement of the matrix |$\boldsymbol{\mathrm{A}}\boldsymbol{\mathrm{\in }} R^{D\times N}$| is considered as follows: $$\begin{equation} \mathbf{A} = \left[ \begin{array}{c c c c} ||\mathbf{v} - \mathbf{p_1}||_2 & 0 & \cdots & 0 \\ 0 & ||\mathbf{v} - \mathbf{p_2}||_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & ||\mathbf{v} - \textbf{p}_{\textbf{n}}||_2 \end{array}\right]. \end{equation}$$ (23) According to [71] the coefficient vector |$\hat{\boldsymbol{\beta }} $| is evaluated as $$\begin{equation} \hat{\boldsymbol{\beta}}={({\boldsymbol{P}}^T\boldsymbol{P}+\mu{\boldsymbol{A}}^T\boldsymbol{A})}^{-1}{\boldsymbol{P}}^T\boldsymbol{S}. \end{equation}$$ (24) After that, by using the class labels of all the training samples, |$\hat{\boldsymbol{\beta }}$| can be subdivided into |$C$| subsets as $$\begin{equation*} \hat{\boldsymbol{\beta}}=\left[\hat{{\boldsymbol{\beta}}_1};\hat{{\boldsymbol{\beta}}_2};\hat{{\boldsymbol{\beta}}_3};\dots;\hat{{\boldsymbol{\beta}}_C}\right] \end{equation*}$$ with |$\hat{{\boldsymbol{\beta }}_j}\ (j=1,2 \dots ,C)$|⁠. Table 6 Performance of KELM, |$l_2$|-CRC, and MSVM for DHA dataset. Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 |$l_2$|-CRC 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 MSVM 0.9609 0.9663 0.9604 0.9981 0.9609 0.9591 Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 |$l_2$|-CRC 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 MSVM 0.9609 0.9663 0.9604 0.9981 0.9609 0.9591 Open in new tab Table 6 Performance of KELM, |$l_2$|-CRC, and MSVM for DHA dataset. Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 |$l_2$|-CRC 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 MSVM 0.9609 0.9663 0.9604 0.9981 0.9609 0.9591 Classifier Recall Precision F1 score Specificity Overall accuracy Kappa coefficient KELM 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 |$l_2$|-CRC 0.9826 0.9848 0.9825 0.9992 0.9826 0.9818 MSVM 0.9609 0.9663 0.9604 0.9981 0.9609 0.9591 Open in new tab Table 7 Running time mean |$\pm $|std of the major components of the algorithm. Major components Running time (ms) 3DMTM based MHI/SHI generation 606.2|$\pm $|40.2/action sample (40 frames) MHF feature extraction 50.4|$\pm $|2.9/action sample (40 frames) SHF feature extraction 50.8|$\pm $|2.7/action sample (40 frames) PCA based dimensionality reduction 0.03|$\pm $| 0.03/action sample (40 frames) KELM ensemble 1.1|$\pm $|0.8/action sample (40 frames) Total running time 708.53|$\pm $|46.63/ 40 frames Major components Running time (ms) 3DMTM based MHI/SHI generation 606.2|$\pm $|40.2/action sample (40 frames) MHF feature extraction 50.4|$\pm $|2.9/action sample (40 frames) SHF feature extraction 50.8|$\pm $|2.7/action sample (40 frames) PCA based dimensionality reduction 0.03|$\pm $| 0.03/action sample (40 frames) KELM ensemble 1.1|$\pm $|0.8/action sample (40 frames) Total running time 708.53|$\pm $|46.63/ 40 frames Open in new tab Table 7 Running time mean |$\pm $|std of the major components of the algorithm. Major components Running time (ms) 3DMTM based MHI/SHI generation 606.2|$\pm $|40.2/action sample (40 frames) MHF feature extraction 50.4|$\pm $|2.9/action sample (40 frames) SHF feature extraction 50.8|$\pm $|2.7/action sample (40 frames) PCA based dimensionality reduction 0.03|$\pm $| 0.03/action sample (40 frames) KELM ensemble 1.1|$\pm $|0.8/action sample (40 frames) Total running time 708.53|$\pm $|46.63/ 40 frames Major components Running time (ms) 3DMTM based MHI/SHI generation 606.2|$\pm $|40.2/action sample (40 frames) MHF feature extraction 50.4|$\pm $|2.9/action sample (40 frames) SHF feature extraction 50.8|$\pm $|2.7/action sample (40 frames) PCA based dimensionality reduction 0.03|$\pm $| 0.03/action sample (40 frames) KELM ensemble 1.1|$\pm $|0.8/action sample (40 frames) Total running time 708.53|$\pm $|46.63/ 40 frames Open in new tab After partitioning |$\hat{\boldsymbol{\beta }}$| the class label of the new sample |$\boldsymbol{S}$| is then evaluated as follows: $$\begin{equation} Class(\boldsymbol{S})={ \begin{array}{c} {\mathop{arg \ min}} \\ j\in \{1,\ 2,\ \dots\ C\} \end{array} \left\{{\left|\left|\boldsymbol{S}-{\boldsymbol{P}}_j\hat{{\boldsymbol{\beta}}_j}\right|\right|}_2\right\}}. \end{equation}$$ (25) 3.2.4. LOGP on |$l_2$|-CRC ensemble For the |$l_2$|-CRC [72] classifier ensemble, the same fusion strategy is used as in KELM, but the weighted fusion strategy is slightly different than described previously. Therefore, the weighted fusion of the |$l_2$|-CRC classifications is presented here. For actual set-up or testing of the sample |$\boldsymbol{S}$|⁠, let us consider the feature vectors MHF and SHF are referred to as |$F_K$| and |$F_L$|⁠, respectively. These vectors are used individually as inputs of two CRC classifiers. As a result, two error vectors |${\boldsymbol{e}}^{\boldsymbol{K}}=[e^K_1,\ e^K_2,\ e^K_3,\ \dots ,e^K_C]$| and |${\boldsymbol{e}}^{\boldsymbol{L}}=[e^L_1,\ e^L_2,\ e^L_3,\ \dots ,e^L_C$|] are created, where |${\boldsymbol{e}}^K$| corresponds to the error vector of the CRC classifier using |${\ F}_K$|⁠, |${\boldsymbol{e}}^{\boldsymbol{L}}$| to the CRC classifier using |${\ F}_L$|⁠. To merge the outcomes of the two classifiers, the LOGP [73] technique is employed. In LOGP, |$p_q(\omega _j |\boldsymbol{S})$| is $$\begin{equation} P\left(\omega_j \mathrel{\left|\vphantom{\omega}\right.}\boldsymbol{S}\right)=\prod^Q_{q=1}{{p_q(\omega_j |\boldsymbol{S})}^{a_q},} \end{equation}$$ (26) where |$\omega _j$| represents a class label of the |$j$|th action class with |$ j =1,\dots ,C $|⁠, also |$Q$| is the quantity of classifiers (⁠|$Q\ =\ 2$| in our system), and with |$a_q$| being uniformly distributed (that is, |$a_q=\frac{1}{Q}$| for |$q=1,2,\dots ,Q$|⁠). Again, according to the residual output |$\boldsymbol{e}\ =\ [e_1\,\ e_2,...,e_C],$| a Gaussian mass function $$\begin{equation} p_q\left(\omega_j\mathrel{\left|\vphantom{\omega}\right.}\boldsymbol{S}\right)\approx{\textrm{exp}}\mathrm{}(-\ {e_j}\boldsymbol{)}, \end{equation}$$ (27) is engaged which reveals a smaller residual error |${{e}}_j \ (j=1,\dots ,C)$| yields a higher probability |$P_q(\omega |\boldsymbol{S})$|⁠. Therefore, in the executed decision-level fusion $$\begin{equation} P\left(\omega_j \mathrel{\left|\vphantom{\omega}\right.}\boldsymbol{S}\right)={\textrm{exp} {(-{{e_j}}^K\ )}^{\frac{1}{2}}}\times{\textrm{exp} {(-{{e_j}}^L\ )}^{\frac{1}{2}}}, \end{equation}$$ (28) where |${\boldsymbol{e}}^K$| and |${\boldsymbol{e}}^L$| are normalized to |$\left [0,\ 1\right ].$| The final class label for |$\boldsymbol{S}$| is then assigned to the class with the largest probability. Mathematically, $$\begin{equation} Class(\boldsymbol{S})={ \begin{array}{c} {\mathop{arg \ max}} \\ j\in \{1,\ 2,\ \dots\ C\} \end{array} \{P\left(\omega_j\mathrel{\left|\vphantom{\omega}\right.}\boldsymbol{S}\right)\}}. \end{equation}$$ (29) 3.2.5. MSVM classifier ensemble and LOGP on the ensemble In this case, the above fusion algorithms, weighted is used to merge the MSVM [74] classification outcomes. For the MSVM, the weighted fusion is easier than other two classifiers. The KELM and the |$l_2$|-CRC classifiers do not imply the probabilistic outputs, and therefore we convert the actual classification outcome to probabilistic outcomes using a sigmoid function or a Gaussian mass function. But, the outcomes of all the MSVM classifications are the required probabilities. So, in our case, two MSVM classifiers output the probabilities corresponding to the two sets of features MHF and SHF, and then these probabilities are combined by employing Equation (19) as in KELM. After using the LOGP, the final class label is obtained following the same way as described by Equation (20). Note that the MSVM also uses the RBF kernel as in KELM to build the classification model. 4. Experimental Results and Discussion We have evaluated our proposed method on MSR-Action3D [14], UTD-MHAD [15] and DHA [16] datasets. After that, we figure out the comparison of our classification outcomes with other remaining approaches. It is fair to mention that our method is compared with other methods based on only overall recognition accuracy as the other methods used overall accuracy metric only to evaluate their systems. Unlike other methods, besides the overall accuracy, our proposed method is also evaluated for several additional metrics for thorough evaluation (see Section 4.1). These metrics also investigate the performances of different classifier ensembles. In all the action datasets, the 3DMTM thresholds, LBP parameters and all the classifiers’ parameters tuning are very crucial to obtain the promising performance of our algorithm. Hence, our method uses some significant techniques to select the optimal values of those parameters. More specifically, in the 3DMTM, along all views, the thresholds |${\zeta } _S$| and |${\zeta } _M\ $| corresponding to the static posture update function |${\varphi } _S(x,\ y,\ t)$| and the motion update function|$\ {\varphi }_M(x,\ y,\ t)$| are set to |${\zeta }_S=50$| and |${\zeta }_M=10,\ $| respectively. It should be noted that we select those thresholds after empirical evaluation. Usually, there is no theoretical explanation for selection of thresholds. But, in our empirical analysis, we always keep |${\zeta }_M$| lower than |${\zeta }_S$| to avoid the confusion of missing of capturing the motion posture accurately. The sizes of MHs/SHIs are determined empirically to |$240\times 320$|⁠, |$240\times 256$| and |$256\times 320$| corresponding to the front (XOY), side (YOZ) and top (XOZ) views, respectively. In LBP, for all views, the MHIs and SHIs are split into |$4\times 2$| overlapped blocks with 50% overlap between two consecutive blocks [29]. The radius |$R$| is set to 1 and the number of sampling points |$N$| is adapted by |$4$| without tuning them. Actually, we choose not to tune them to keep algorithm free from much parameter tuning and to propose a simple system with a small number of parameters. Further tuning can be investigated to achieve any increase in accuracy. The RBF kernel (defined in (Section 3.2.1) is used in KELM and MSVM algorithms. Thus, there are two main parameters needed to be optimized such as the penalty parameter |$\mathrm{\textrm{C}}$| and the kernel bandwidth parameter|$\ \gamma $|⁠. The |$\mathrm{\textrm{C}}$| regulates the trade-off in the middle of model complexity in classifier and the model fitting accuracy maximization. The bandwidth parameter|$\ \gamma $| determines the nonlinear transformation of the input feature space to a feature space in high dimension. The parameter optimization concerning the KELM, MSVM and |$l_2$|-CRC is accomplished through the 5-fold cross-validation on the training samples corresponding to every action dataset. In KELM and MSVM, the parameter |$C$| and the parameter |$\gamma $| are tuned in a range of |$10\sim{10}^8$| and |${10}^{-8}\sim 300,$| respectively. The |$l_2$|-CRC parameter |$\mu $| is determined in a range of |$\ 0.00001\sim 10$|⁠. In all datasets, the dimension of MHF and SHF feature vectors are the same (since the MHIs/SHIs and LBP blocks are same in sizes) and each feature vector is of |$945$| dimensional. To enhance the computational ease of the algorithm, principal component analysis (PCA) is employed to reduce the dimensions of the obtained action vector. In all the experiments, the principal components that account for 99% of the entire variation are retained. Note that the above discussed techniques are utilized for all the action datasets involved in all the experimental evaluations of our proposed system. 4.1. Evaluation measures In this experiment, we consider the recall (sensitivity), precision, F1 score, specificity, overall accuracy and kappa coefficient measures to check the classifier performance. 4.2. Evaluation on MSR-Action3D dataset MSR-Action3D dataset [14] contains 20 actions, where actions are performed 2 or 3 times by 10 different subjects by facing the RGB-D camera. These action categories are considered in the context of sports and cover a diversity of motions related to arms, legs, torso, etc. In the dataset, inter-similarity among different types of actions are observed, such as draw x and draw tick are similar except for a slight difference in the movement of one hand. We employ the entire 20 actions to evaluate our method and the validation technique follows the set-up as illustrated by Li et al. in [14]. More precisely, the obtained samples of subjects 1, 3, 5, 7 and 9 are used for training, whereas samples from actors 2, 4, 6, 8 and 10 are used for testing/validation. In fact, the validation scheme is one of the most commonly adapted due to computational efficiency [75]. The RBF kernel-based KELM model training step uses |$C=100$| and|$\ \gamma =0.06$| as optimal parameter values, the RBF kernel-based MSVM parameters’ pair (⁠|$C\mathrm{,}\gamma $|⁠) are assigned values (⁠|${10}^3,\ {10}^{-8}$|⁠) and the |$l_2$|-CRC parameter |$\mu $| is set to|$\ 1.3$|⁠. Our proposed algorithm shows recognition accuracy of 95.97% (see Table 1) when we employ the KELM classifier ensemble. Table 1 displays that the proposed system surpasses other existing systems considerably. The proposed approach exhibits higher recognition outcome in comparison with the methods, which used the motion information only, as reported in [21, 26, 35, 36, 56]. Specifically, our approach achieves recognition accuracy that is 7.27% higher than [21], 2.97% higher than [26], 1.1% higher than [35], 0.77% higher than [36] and 9.87% higher than [56]. Although our method uses the handcrafted features, the proposed method outperforms the motion image-based deep learning method reported in [56] (3D-CNN with DMM Pyramid), [56] (2D-CNN with DMM Pyramid) by 4.87 %, [57] (3D-CNN+DHI+relief+SVM) by 3.27 %, [58] (WDMM+CNN) by 5.97 %, [59] (Deep Activations) by 3.67 %, [59] (Deep Activations + Attributes) by 2.57 %, (see Table 1). Those improvement rates in recognition outcomes indicate that the feature descriptor could be applied on both of motion images and static images to gather more informative features. Furthermore, the method in [26] utilized similar tools but our method outperforms the method in [26] by 2.9% in terms of recognition accuracy. So, the tools used in the two methods are similar, but the feature representation scheme is different. In [26], the feature extracted from motion images along three projection views are represented individually through three vectors. On the other hand, we incorporate the three feature vectors to represent the motion information and construct the MHF vector. Similarly, the combination of feature vectors from motionless image provides the SHF vector as additional information. So, unlike the method in [26], we change the feature representation technique and include the motionless information to obtain the promising outcome. There are several challenges in recognition process. Among these, two major issues are intra-class variability (difference between actions in same classes) and inter-class similarity (similarity between two different classes of action). There are a few similarities among the actions, draw tick, draw x and draw circle. If we notice action frames of draw tick as well as draw x and draw circle in Fig. 6, we can find some similarities among different frames. At a certain time, a few confusions are found between the action frames draw tick and draw circle. As a result, the classification suffers degradation and the misclassification rate is 5.9%. Similarly, the confusion rate of the action draw tick with draw circle is 5.9%. In similar manner, due to inter-class similarity seven actions cannot be classified exactly. Figure 6 represents the confusion matrix where all the misclassification rates are shown. We also apply another two classifiers called |$l_2$|-CRC and MSVM on this dataset which show 94.87% and 92.31% accuracy, respectively (see Table 2). To calculate the performance of each classifier ensemble, we also take some statistical measurements as illustrated by Table 2. Here, the results in Table 2 also show that KELM has a better performance than |$l_2$|-CRC and MSVM ensembles. A class specific action accuracy result comparison is shown in Fig. 9 where it is observed that KELM shows superiority for the class-specific accuracy for every action over other two classifier ensembles. The confusion matrices for the other two classifier ensembles are also shown in Figs 7 and 8, respectively. 4.3. Evaluation on UTD-MHAD dataset The UTD-MHAD [15] action dataset has 27 different actions performed by 8 individuals (4 females and 4 males) in front of a Microsoft Kinect camera. Each individual performs each action four times. In fact, the dataset consists of 861 depth action video clips after trimming the three corrupted clips. The dataset is very complex as there are diverse actions such as sport actions (e.g. bowling), hand gestures (e.g. draw x), daily activities (e.g. knock on door) and training exercises (e.g. arm curl). A total of 27 actions are employed and half of the total subjects, i.e. 1, 3, 5 and 7 are utilized for training and the rest ones, i.e. 2, 4, 6 and 8 are used for validation/testing as described in [15]. The RBF kernel-based KELM and MSVM train the classification model by setting parameters’ pair (⁠|$C\mathrm{,}\gamma $|⁠) to |$({10}^4,0.03)$| and (⁠|${10}^6,\ {10}^{-8}$|⁠), respectively. Besides, the |$l_2$|-CRC parameter |$\mu $| is optimized to |$0.01$| for promising classification. Experimental evaluations on UTD-MHAD dataset by using KELM, |$l_2$|-CRC and MSVM classifier ensembles show that KELM achieves better accuracy than the other two classifiers (see Tables 3 and 4). Table 3 represents the recognition accuracy of our approach and other existing approaches. It is clear from the table that all our methods exhibit better accuracy than other listed methods in Table 3. More precisely, our method outperforms [21] by 8.73%, [15] (Kinect) by 24.13%, [15] (Kinect & Inertial) by 11.13% and [36] by 5.83%. Those methods utilized some powerful and complex feature extraction and classification strategies. In contrast, our method, though simple, outperforms those methods considerably. Thus, the outstanding performance of our method suggests that the static information besides the motion information can enhance the recognition system significantly. Table 4 contains the results of statistical evaluations of classifiers which we employ in our method. For all the used classifiers, the confusion matrices and the class-specific accuracy comparison are also shown in Figs 10–13. 4.4. Evaluation on DHA dataset The DHA dataset was introduced by Lin et al. [16]. In this dataset, action types are extended from the Weizmann dataset [86], which is widely used in action recognition from RGB sequences. In fact, DHA dataset consists of 23 action categories, where the 1st–10th action categories follow the same definitions in the Weizmann action dataset [87] and the 11th–16th actions are extended categories. The remaining actions (i.e. 17th–23rd) belong to the categories of selected sport actions. The dataset contains overall 483 depth action sequences, where each action sequence is performed by 21 subjects (12 and 9 females). In this dataset, inter-similarity among different types of action classes is found. For example, golf-swing and rod-swing actions contain similar motion segments with the movements of hands from one side up to the other side. More analogous action pairs could be observed in leg-curl and leg-kick, run and walk, etc. The action samples relevant to the subjects 1, 3, 5, 7, 9, 11, 13, 15, 17, 19 and 21 for training and the samples given by actors 2, 4, 6, 8, 10, 12, 14, 16, 18,and 20 are used for validation test [16]. The RBF kernel-based KELM trains the classification model with |$C={10}^4$| and|$\ \gamma =0.05$| as optimal values, the RBF kernel-based MSVM parameters’ pair (⁠|$C\mathrm{,}\gamma $|⁠) are optimized to (⁠|${10}^4,\ {10}^{-8}$|⁠) and the |$l_2$|-CRC parameter |$\mu $| is adjusted to |$0.1$| for promising classification outcomes. Fig. 14. Open in new tabDownload slide Confusion matrix on DHA dataset by using KELM classifier ensemble. Fig. 14. Open in new tabDownload slide Confusion matrix on DHA dataset by using KELM classifier ensemble. Fig. 15. Open in new tabDownload slide Confusion matrix on DHA dataset by using |$l_2$|-CRC classifier ensemble. Fig. 15. Open in new tabDownload slide Confusion matrix on DHA dataset by using |$l_2$|-CRC classifier ensemble. Fig. 16. Open in new tabDownload slide Confusion matrix on DHA dataset by using MSVM classifier ensemble. Fig. 16. Open in new tabDownload slide Confusion matrix on DHA dataset by using MSVM classifier ensemble. Fig. 17. Open in new tabDownload slide Class specific accuracy for KELM, |$l_2$|-CRC and MSVM on DHA dataset. Fig. 17. Open in new tabDownload slide Class specific accuracy for KELM, |$l_2$|-CRC and MSVM on DHA dataset. After completing our experiment for DHA dataset by using KELM, |$l_2$|-CRC and MSVM classifier, we notice that our three classifiers achieve considerable performance (see Tables 5 and 6). The KELM and |$l_2$|-CRC classifiers exhibit same classification performance with overall accuracy of 98.26%. However, for KELM and |$l_2$|-CRC classifiers, only three actions are misclassified among 23 actions categories (see Figs 14 and 15) where the MSVM classifier is responsible for more misclassifications (see Fig. 16). Our approach outperforms [16] by 11.46%, [26] by 6.96%, [33] by 2.82%, [36] by 1.57 %, [48] (D-DMHI-PHOG) by 5.86% and [48] (DMPP-PHOG) by 3.26%. Those methods operated feature descriptors on motion images and our method operates descriptor on motion and static images. Thus, the features from static images are also very discriminative when used in addition to the features obtained from motion images. Note that the method in [26] utilized the similar tools like our method but our method outperforms the method by 6.96%. With better feature representation, our proposed method achieves 6.96% higher recognition rate. The class-specific accuracy regarding the three classifier ensembles is shown in Fig. 17. 5. Computational Efficiency The computational efficiency of our method is measured with the running time of the major components involved in the algorithm and by the computational complexity of the major components. Indeed, the operation time varies from machine to machine, and hence the computational complexity is taken into account to perceive the efficiency of an algorithm. 5.1. Running time The proposed system is evaluated on CPU platform with an Intel i5-7500 Quad-core CPU @3.41 GHz and a 16 GB RAM capacity. The processing time of proposed approach depends on the five major components such as 3DMTM-based MHI/SHI generation, MHF features extraction, SHF features extraction, PCA-based dimensionality reduction and KELM classification. Although we employ ensembles with three different classifiers, the running time for KELM ensemble is reported here as it exhibits promising performance. The average running time (in millisecond) for the five components per action sample is represented in Table 7. The required time is observed on the MSR-Action3D [14] dataset where each action sample contains depth frames of 40 on average. In Table 7, the running time for MHF generation resembles the SHF computation time as the sizes of inputs (MHIs/SHIs) in both cases are same. The total running time for the 40 frames is less than 1 second, i.e. 708.53 |$\pm $| 46.63 milliseconds. Consequently, it can be claimed that our recognition system is consistent in real-time operation to process above 40 depth video frames per second. 5.2. Computational complexity The computational complexity of our method involves the complexities of PCA and KELM ensemble. The PCA has complexity of |$O\left (m^3+m^2r\right )$| [22] and the KELM ensemble works with complexity of |$\ 2\times O\left (r^3\right )$| [88]. Thus, the total computational complexity of the proposed system is |$\ O\left (m^3+m^2r\right )+2\times O\left (r^3\right )$|⁠. The computational complexity is reported and compared with other methods’ complexities in Table 8. It is noticeable; the method described in [26] employed a similar feature extraction algorithm and classifier ensemble resembling our approach. But the system in [26] has higher computational complexity than our approach (see Table 8). In fact, the method utilized three KELM classifiers to form the classifier ensemble whereas we use two KELM algorithms. Besides, our approach outperforms the system by about 3% on MSR-Action3D dataset and 7% in DHA dataset in terms of recognition accuracy (see Tables 1 and 5). Overall, our system achieves superiority over the method in [26] with lower computational complexity and higher recognition accuracy. In Table 8, the computational complexities of methods represented by [14] and [21] are lower than our method but the recognition accuracy of those methods are much inferior to our approach (see Tables 1, 3 and 5). However, our approach also achieves outstanding performance over all other methods listed in Table 8. It is worth to mention that the recognition outcomes of the our system are not compared with the system in [39] as the approach was evaluated on different experimental set-up on MSR-Action3D dataset. Thus, the proposed method shows superiority with regard to the recognition rate as well as computational efficiency. 6. Conclusions This paper has introduced a framework to recognize human action from the depth action data provided by depth sensor. Each depth action sequence is represented through motion and static hierarchical features vector individually. The motion and static hierarchical features are captured through the LBP operator from motion and static posture images, where these images are output of 3DMTM for an action sequence. The evaluation of the framework on three benchmark datasets indicates the state-of-the-art achievement over the existing methods. Due to our feature representation strategy, the proposed method outperforms methods with similar tools usage by 2.9% for MSR-Action3D and 6.96% for DHA dataset. The comparison of recognition rate of this system is also figured out with the systems which only used the motion posture image-based features to represent actions. This comparison indicates that the use of motion and static posture image-based features (that we employ in our method) improves recognition accuracy significantly. Our method also exhibits superiority over the motion image-based deep learning methods named ‘3D-CNN with DMM Pyramid’ surpassing by 9.87 % ‘2D-CNN with DMM Pyramid’ by 4.87 %, ‘3D-CNN+DHI+relief+SVM’ by 3.27 %, ‘WDMM+CNN’ by 5.97 %, ‘Deep Activations’ by 3.67 % and ‘Deep Activations + Attributes’ by 2.57 %. However, we use three different classifier ensembles to classify actions using the obtained feature vectors. Among those ensembles, the KELM ensemble outperforms other two ensembles in accordance with the statistical measurements. For all datasets, the MSVM ensemble shows inferior recognition results compared to other two ensembles. The calculated confusion matrix indicates that the confusion between action patterns which are similar (such as draw tick and draw x in Fig. 6) cannot be completely eliminated. Actually, in our work, when some of frames of two separate depth action clips are the same then the corresponding individual MHI/SHI of two actions preserves similar movements of actors. Thus, the action description with the MHI/SHI of two diverse actions is involved in confusion. This can be explained as the MHI/SHI representations for draw x and draw tick video clips have similarities which then cause the confusion. In our next work, we will change the MHI and SHI construction strategy and will attempt to represent an action video through multiple MHI and SHI instead of single ones. The aim is to get enough MHI and SHI to build a 2D-CNN deep model for recognizing human actions more robustly. ACKNOWLEDGMENTS This work is jointly supported by Jashore University of Science and Technology, and Bangladesh University Grants Commission (UGC), Bangladesh. References 1 Chen , C. , Jafari , R. and Kehtarnavaz , N. ( 2015 ) Improving human action recognition using fusion of depth camera and inertial sensors . IEEE Trans. Hum. Mach. Syst. , 45 , 51 – 61 . Google Scholar Crossref Search ADS WorldCat 2 Chen , C. , Kehtarnavaz , N. and Jafari , R. ( 2014 ) A medication adherence monitoring system for pill bottles based on a wearable inertial sensor . In Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual Int. Conf. IEEE , pp. 4983 – 4986 . IEEE . 3 Chen , C. , Liu , K. , Jafari , R. and Kehtarnavaz , N. ( 2014 ) Home-based senior fitness test measurement system using collaborative inertial and depth sensors . In Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual Int. Conf. IEEE , pp. 4135 – 4138 . IEEE. 4 Han , J. , Pauwels , E. J. , de Zeeuw , P. M. , and de With , P. H. ( 2012 ) Employing a rgb-d sensor for real-time tracking of humans across multiple re-entries in a smart environment . IEEE Trans. Consumer Electron. , 58 , 255 – 263 . Google Scholar Crossref Search ADS WorldCat 5 Han , J. , Shao , L. , Xu , D. and Shotton , J. ( 2013 ) Enhanced computer vision with microsoft kinect sensor: a review . IEEE Trans. Cybernet. , 43 , 1318 – 1334 . Google Scholar Crossref Search ADS WorldCat 6 Liu , L. and Shao , L. ( 2013 ) Learning discriminative representations from RGB-D video data . In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China , pp. 1493 – 1500 . AAAI . WorldCat 7 Yu , M. , Liu , L. and Shao , L. ( 2016 ) Structure-preserving binary representations for rgb-d action recognition . IEEE Trans. Pattern Anal. Mach. Intell. , 38 , 1651 – 1664 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Adhikari , K. , Bouchachia , H. and Nait-Charif , H. ( 2017 ) Activity recognition for indoor fall detection using convolutional neural network . In 2017 15th IAPR Int. Conf. Machine Vision Applications (MVA) , pp. 81 – 84 . IEEE . 9 Chen , C. , Zhang , B. , Su , H. , Li , W. and Wang , L. ( 2016 ) Land-use scene classification using multi-scale completed local binary patterns . Signal Image Video P. , 10 , 745 – 752 . Google Scholar Crossref Search ADS WorldCat 10 Zhu , H.-M. and Pun , C.-M. ( 2013 ) Human action recognition with skeletal information from depth camera . In 2013 IEEE Int. Conf. Information and Automation (ICIA) , pp. 1082 – 1085 . IEEE . 11 Yang , X. and Tian , Y. ( 2014 ) Super normal vector for activity recognition using depth sequences . In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 804 – 811 . IEEE . 12 Chen , C. , Zhou , L. , Guo , J. , Li , W. , Su , H. and Guo , F. ( 2015 ) Gabor-filtering-based completed local binary patterns for land-use scene classification . In 2015 IEEE Int. Conf. Multimedia Big Data , pp. 324 – 329 . IEEE . 13 Shotton , J. , Fitzgibbon , A. , Cook , M. , Sharp , T. , Finocchio , M. , Moore , R. , Kipman , A. and Blake , A. ( 2011 ) Real-time human pose recognition in parts from single depth images . In 2011 IEEE Conf. Computer Vision and Pattern Recognition (CVPR) , pp. 1297 – 1304 . IEEE . 14 Li , W. , Zhang , Z. and Liu , Z. ( 2010 ) Action recognition based on a bag of 3d points . In 2010 IEEE Computer Society Conf. Computer Vision and Pattern Recognition-Workshops , pp. 9 – 14 . IEEE . 15 Chen , C. , Jafari , R. and Kehtarnavaz , N. ( 2015 ) Utd-mhad: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor . In 2015 IEEE Int. Conf. Image Processing (ICIP) , pp. 168 – 172 . IEEE . 16 Lin , Y.-C. , Hu , M.-C. , Cheng , W.-H. , Hsieh , Y.-H. and Chen , H.-M. ( 2012 ) Human action recognition and retrieval using sole depth information . In Proc. 20th ACM Int. Conf. Multimedia , pp. 1053 – 1056 . ACM . 17 Chen , L. , Wei , H. and Ferryman , J. ( 2013 ) A survey of human motion analysis using depth imagery . Pattern Recogn. Lett. , 34 , 1995 – 2006 . Google Scholar Crossref Search ADS WorldCat 18 Vieira , A.W. , Nascimento , E.R. , Oliveira , G.L. , Liu , Z. and Campos , M.F. ( 2012 ) Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences . In Iberoamerican Congress on Pattern Recognition , pp. 252 – 259 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 19 Wang , J. , Liu , Z. , Chorowski , J. , Chen , Z. and Wu , Y. ( 2012 ) Robust 3d action recognition with random occupancy patterns . In Computer Vision–ECCV 2012 , pp. 872 – 885 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 20 Bobick , A.F. and Davis , J.W. ( 2001 ) The recognition of human movement using temporal templates . IEEE Trans. Pattern Anal. Mach. Intell. , 23 , 257 – 267 . Google Scholar Crossref Search ADS WorldCat 21 Yang , X. , Zhang , C. and Tian , Y. ( 2012 ) Recognizing actions using depth motion maps-based histograms of oriented gradients . In Proc. 20th ACM Int. Conf. Multimedia , pp. 1057 – 1060 . ACM . 22 Chen , C. , Liu , K. and Kehtarnavaz , N. ( 2016 ) Real-time human action recognition based on depth motion maps . J. Real-Time Image Pr. , 12 , 155 – 163 . Google Scholar Crossref Search ADS WorldCat 23 Xia , L. and Aggarwal , J. ( 2013 ) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera . In Proc. IEEE Conf. Computer Vision and Pattern Recognition , pp. 2834 – 2841 . 24 Oreifej , O. and Liu , Z. ( 2013 ) Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences . In Proc. IEEE Conf. Computer Vision and Pattern Recognition , pp. 716 – 723 . IEEE . 25 Ojala , T. , Pietikäinen , M. and Mäenpää , T. ( 2002 ) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns . IEEE Trans. Pattern Anal. Mach. Intell. , 24 , 971 – 987 . Google Scholar Crossref Search ADS WorldCat 26 Chen , C. , Jafari , R. and Kehtarnavaz , N. ( 2015 ) Action recognition from depth sequences using depth motion maps-based local binary patterns . In 2015 IEEE Winter Conf. Applications of Computer Vision (WACV) , pp. 1092 – 1099 . IEEE . 27 Bulbul , M.F. , Jiang , Y. and Ma , J. ( 2015 ) Human action recognition based on dmms, hogs and contourlet transform . In 2015 IEEE Int. Conf. Multimedia Big Data , pp. 389 – 394 . IEEE . Google Preview WorldCat COPAC 28 Bulbul , M.F. , Jiang , Y. and Ma , J. ( 2015 ) Real-time human action recognition using dmms-based lbp and eoh features . In Int. Conf. Intelligent Computing , pp. 271 – 282 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 29 Bulbul , M.F. , Jiang , Y. and Ma , J. ( 2015 ) Dmms-based multiple features fusion for human action recognition . Int. J. Multimed. Data Eng. Manag. , 6 , 23 – 39 . Google Scholar Crossref Search ADS WorldCat 30 Liu , H. , Tian , L. , Liu , M. and Tang , H. ( 2015 ) Sdm-bsm: A fusing depth scheme for human action recognition . In 2015 IEEE Int. Conf. Image Processing (ICIP) , pp. 4674 – 4678 . IEEE . Google Preview WorldCat COPAC 31 Rahmani , H. , Huynh , D.Q. , Mahmood , A. and Mian , A. ( 2016 ) Discriminative human action classification using locality-constrained linear coding . Pattern Recogn. Lett. , 72 , 62 – 71 . Google Scholar Crossref Search ADS WorldCat 32 Kong , Y. , Satarboroujeni , B. and Fu , Y. ( 2015 ) Hierarchical 3d kernel descriptors for action recognition using depth sequences . In 2015 11th IEEE Int. Conf. Workshops on Automatic Face and Gesture Recognition (FG) , pp. 1 – 6 . IEEE . Google Preview WorldCat COPAC 33 Chen , C. , Liu , M. , Zhang , B. , Han , J. , Jiang , J. and Liu , H. ( 2016 ) 3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector . In Proceedings of the International Joint Conferences on Artificial Intelligence , pp. 3331 – 3337 . AAAI. 34 Chen , C. , Hou , Z. , Zhang , B. , Jiang , J. and Yang , Y. ( 2015 ) Gradient local auto-correlations and extreme learning machine for depth-based activity recognition . In Int. Symposium on Visual Computing , pp. 613 – 623 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 35 Chen , C. , Zhang , B. , Hou , Z. , Jiang , J. , Liu , M. and Yang , Y. ( 2017 ) Action recognition from depth sequences using weighted fusion of 2d and 3d auto-correlation of gradients features . Multimed. Tools Appl. , 76 , 4651 – 4669 . Google Scholar Crossref Search ADS WorldCat 36 Zhang , B. , Yang , Y. , Chen , C. , Yang , L. , Han , J. and Shao , L. ( 2017 ) Action recognition using 3d histograms of texture and a multi-class boosting classifier . IEEE Trans. Image Process. , 26 , 4648 – 4660 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Yacoob , Y. and Black , M.J. ( 1999 ) Parameterized modeling and recognition of activities . Comput. Vision Image Understand. , 73 , 232 – 247 . Google Scholar Crossref Search ADS WorldCat 38 Lv , F. and Nevatia , R. ( 2006 ) Recognition and segmentation of 3-d human action using hmm and multi-class adaboost . In European Conference on Computer Vision , pp. 359 – 372 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 39 Xia , L. , Chen , C.-C. and Aggarwal , J.K. ( 2012 ) View invariant human action recognition using histograms of 3d joints . In 2012 IEEE Computer Society Conf. Computer Vision and Pattern Recognition Workshops (CVPRW) , pp. 20 – 27 . IEEE . Google Preview WorldCat COPAC 40 Azary , S. and Savakis , A. ( 2012 ) 3D action classification using sparse spatio-temporal feature representations . In Int. Symposium on Visual Computing , pp. 166 – 175 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 41 Yang , X. and Tian , Y.L. ( 2012 ) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor . In 2012 IEEE Computer Society Conf. Computer Vision and Pattern Recognition Workshops (CVPRW) , pp. 14 – 19 . IEEE . Google Preview WorldCat COPAC 42 Wang , J. , Liu , Z. , Wu , Y. and Yuan , J. ( 2012 ) Mining actionlet ensemble for action recognition with depth cameras . In 2012 IEEE Conf. Computer Vision and Pattern Recognition , pp. 1290 – 1297 . IEEE . Google Preview WorldCat COPAC 43 Chaudhry , R. , Ofli , F. , Kurillo , G. , Bajcsy , R. and Vidal , R. ( 2013 ) Bio-inspired dynamic 3d discriminative skeletal features for human action recognition . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , pp. 471 – 478 . IEEE 44 Zanfir , M. , Leordeanu , M., and Sminchisescu , C. ( 2013 ) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection . In Proceedings of the IEEE international conference on computer vision , pp. 2752 – 2759 . IEEE WorldCat 45 Hussein , M.E. , Torki , M. , Gowayyed , M.A., and El-Saban , M. ( 2013 ) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations . In Proceedings of the 23rd International Joint Conference on Artificial Intelligence , 13 , 2466 – 2472 . AAAI . 46 Chaaraoui , A.A. , Padilla-López , J.R. , Climent-Pérez , P. and Flórez-Revuelta , F. ( 2014 ) Evolutionary joint selection to improve human action recognition with rgb-d devices . Expert Syst. Appl. , 41 , 786 – 794 . Google Scholar Crossref Search ADS WorldCat 47 Vemulapalli , R. , Arrate , F., and Chellappa , R. ( 2014 ) Human action recognition by representing 3d skeletons as points in a lie group . In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 588 – 595 . IEEE . WorldCat 48 Gao , Z. , Zhang , H. , Xu , G. and Xue , Y. ( 2015 ) Multi-perspective and multi-modality joint representation and recognition model for 3d action recognition . Neurocomputing , 151 , 554 – 564 . Google Scholar Crossref Search ADS WorldCat 49 Zhang , H. and Parker , L.E. ( 2011 ) 4-dimensional local spatio-temporal features for human activity recognition . In 2011 IEEE/RSJ Int. Conf. Intelligent Robots and Systems , pp. 2044 – 2049 . IEEE . Google Preview WorldCat COPAC 50 Luo , J. , Wang , W. and Qi , H. ( 2014 ) Spatio-temporal feature extraction and representation for rgb-d human action recognition . Pattern Recogn. Lett. , 50 , 139 – 148 . Google Scholar Crossref Search ADS WorldCat 51 Rahmani , H. , Mahmood , A. , Huynh , D.Q. and Mian , A. ( 2014 ) Real time action recognition using histograms of depth gradients and random decision forests . In IEEE Winter Conf. Applications of Computer Vision , pp. 626 – 633 . IEEE . Google Preview WorldCat COPAC 52 Sung , J. , Ponce , C. , Selman , B. and Saxena , A. ( 2012 ) Unstructured human activity detection from rgbd images . In 2012 IEEE Int. Conf. Robotics and Automation , pp. 842 – 849 . IEEE . Google Preview WorldCat COPAC 53 Ali , H. , Tran , S.N. , Benetos , E. and d’Avila Garcez , A.S. ( 2018 ) Speaker recognition with hybrid features from a deep belief network . Neural Comput. Applic. , 29 , 13 – 19 . Google Scholar Crossref Search ADS WorldCat 54 Iqbal , T. and Ali , H. ( 2018 ) Generative adversarial network for medical images (mi-Gan) . J. Med. Syst. , 42 , 231 . Google Scholar Crossref Search ADS PubMed WorldCat 55 Wang , L. , Zhang , B. and Yang , W. ( 2015 ) Boosting-like deep convolutional network for pedestrian detection . In Chinese Conf. Biometric Recognition , pp. 581 – 588 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 56 Yang , R. and Yang , R. ( 2014 ) Dmm-pyramid based deep architectures for action recognition with depth cameras . In Asian Conf. Computer Vision , pp. 37 – 49 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 57 Kecceli , A.S. , Kaya , A. and Can , A.B. ( 2018 ) Combining 2d and 3d deep models for action recognition with depth information . Signal Image Video P. , 12 , 1197 – 1205 . Google Scholar Crossref Search ADS WorldCat 58 Azad , R. , Asadi-Aghbolaghi , M. , Kasaei , S., and Escalera , S. ( 2018 ) Dynamic 3D hand gesture recognition by learning weighted depth motion maps . In IEEE Transactions on Circuits and Systems for Video Technology , 29 ( 6 ), pp. 1729 – 1740 . IEEE 59 Zhang , C. , Tian , Y. , Guo , X. and Liu , J. ( 2018 ) Daal: Deep activation-based attribute learning for action recognition in depth videos . Comput. Vision Image Understand. , 167 , 37 – 49 . Google Scholar Crossref Search ADS WorldCat 60 Liang , B., and Zheng , L. ( 2013 ) Three dimensional motion trail model for gesture recognition . In Proceedings of the IEEE International Conference on Computer Vision Workshops , pp. 684 – 691 . IEEE 61 Ojala , T. , Pietikainen , M. and Harwood , D. ( 1994 ) Performance evaluation of texture measures with classification based on kullback discrimination of distributions . In Proc. 12th Int. Conf. Pattern Recognition , pp. 582 – 585 . IEEE . Google Preview WorldCat COPAC 62 Ojala , T. , Pietikäinen , M. and Harwood , D. ( 1996 ) A comparative study of texture measures with classification based on featured distributions . Pattern Recog. , 29 , 51 – 59 . Google Scholar Crossref Search ADS WorldCat 63 Huang , G.-B. , Zhou , H. , Ding , X. and Zhang , R. ( 2012 ) Extreme learning machine for regression and multiclass classification . IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) , 42 , 513 – 529 . Google Scholar Crossref Search ADS WorldCat 64 Huang , G.-B. , Zhu , Q.-Y. and Siew , C.-K. ( 2006 ) Extreme learning machine: theory and applications . Neurocomputing , 70 , 489 – 501 . Google Scholar Crossref Search ADS WorldCat 65 Platt , J. et al. ( 1999 ) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods . Adv. Large Margin Classif. , 10 , 61 – 74 . WorldCat 66 Wright , J. , Ma , Y. , Mairal , J. , Sapiro , G. , Huang , T.S. and Yan , S. ( 2010 ) Sparse representation for computer vision and pattern recognition . Proc. IEEE , 98 , 1031 – 1044 . Google Scholar Crossref Search ADS WorldCat 67 Tikhonov , A.N. and Arsenin , V.I. ( 1977 ) Solutions of Ill-Posed Problems . Vh Winston . 68 Chen , C. , Tramel , E.W. and Fowler , J.E. ( 2011 ) Compressed-sensing recovery of images and video using multihypothesis predictions . In 2011 Conf. Record of the Forty Fifth Asilomar Conf. Signals, Systems and Computers (ASILOMAR) , pp. 1193 – 1198 . IEEE . Google Preview WorldCat COPAC 69 Chen , C. , Li , W. , Tramel , E.W. and Fowler , J.E. ( 2014 ) Reconstruction of hyperspectral imagery from random projections using multihypothesis prediction . IEEE Trans. Geosci. Remote Sens. , 52 , 365 – 374 . Google Scholar Crossref Search ADS WorldCat 70 Chen , C. and Fowler , J.E. ( 2012 ) Single-image super-resolution using multihypothesis prediction . In 2012 Conf. Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR) , pp. 608 – 612 . IEEE . Google Preview WorldCat COPAC 71 Golub , G.H. , Hansen , P.C. and O’Leary , D.P. ( 1999 ) Tikhonov regularization and total least squares . SIAM J. Matrix Anal. Appl. , 21 , 185 – 194 . Google Scholar Crossref Search ADS WorldCat 72 Wright , J. , Yang , A.Y. , Ganesh , A. , Sastry , S.S. and Ma , Y. ( 2009 ) Robust face recognition via sparse representation . IEEE Trans. Pattern Anal. Mach. Intell. , 31 , 210 – 227 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Benediktsson , J.A. and Sveinsson , J.R. ( 2003 ) Multisource remote sensing data classification based on consensus and pruning . IEEE Trans. Geosci. Remote Sens. , 41 , 932 – 936 . Google Scholar Crossref Search ADS WorldCat 74 Chang , C.-C. and Lin , C.-J. ( 2011 ) Libsvm: A library for support vector machines . ACM Trans. Intell. Syst. Tech. , 2 , 27 . Google Scholar Crossref Search ADS WorldCat 75 Padilla-López , J. R. , Chaaraoui , A. A. , and Flórez-Revuelta , F. ( 2014 ) A discussion on the validation tests employed to compare human action recognition methods using the msr action3d dataset . CoRR , abs/1407.7390 , http://arxiv.org/abs/1407.7390. 76 Liang , C. , Chen , E. , Qi , L. and Guan , L. ( 2016 ) 3D action recognition using depth-based feature and locality-constrained affine subspace coding . In 2016 IEEE Int. Symposium on Multimedia (ISM) , pp. 261 – 266 . IEEE . Google Preview WorldCat COPAC 77 Liu , J. , Shahroudy , A. , Xu , D. and Wang , G. ( 2016 ) Spatio-temporal lstm with trust gates for 3d human action recognition . In European Conf. Computer Vision , pp. 816 – 833 . Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 78 Yang , X. and Tian , Y. ( 2017 ) Super normal vector for human activity recognition with depth cameras . IEEE Trans. Pattern Anal. Mach. Intell. , 39 , 1028 – 1039 . Google Scholar Crossref Search ADS PubMed WorldCat 79 Liu , J. , Shahroudy , A. , Xu , D. , Kot , A.C. and Wang , G. ( 2018 ) Skeleton-based action recognition using spatio-temporal lstm network with trust gates . IEEE Trans. Pattern Anal. Mach. Intell. , 40 , 3007 – 3021 . Google Scholar Crossref Search ADS PubMed WorldCat 80 Weng , J. , Weng , C., and Yuan , J. ( 2017 ) Spatio-temporal naive-bayes nearest-neighbor (st-nbnn) for skeleton-based action recognition . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 4171 – 4180 . IEEE . WorldCat 81 Asadi-Aghbolaghi , M. and Kasaei , S. ( 2018 ) Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos . Multimedia Tools and Applications , 77 ( 11 ), pp. 14115 – 14135 . 82 Nguyen , X.S. , Mouaddib , A.-I. , Nguyen , T.P. and Jeanpierre , L. ( 2018 ) Action recognition in depth videos using hierarchical gaussian descriptor . Multimed. Tools Appl. , 77 , 21617 – 21652 . Google Scholar Crossref Search ADS WorldCat 83 Bulbul , M.F. , Islam , S. and Ali , H. ( 2019 ) Human action recognition using mhi and shi based glac features and collaborative representation classifier . J. Intell. Fuzzy Syst. , 36 , 3385 – 3401 . Google Scholar Crossref Search ADS WorldCat 84 Wang , P. , Li , W. , Li , C. and Hou , Y. ( 2018 ) Action recognition based on joint trajectory maps with convolutional neural networks . Knowledge-Based Syst. , 158 , 43 – 53 . Google Scholar Crossref Search ADS WorldCat 85 McNally , W. , Wong , A. , and McPhee , J. ( 2019 ) Star-net: Action recognition using spatio-temporal activation reprojection . CoRR , abs/1902.10024 , pp. 1 – 8 , http://arxiv.org/abs/1902.10024. 86 Gorelick , L. , Blank , M. , Shechtman , E. , Irani , M. and Basri , R. ( 2007 ) Actions as space-time shapes . IEEE Trans. Pattern Anal. Mach. Intell. , 29 , 2247 – 2253 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Gorelick , L. , Blank , M. , Shechtman , E. , Irani , M. and Basri , R. ( 2007 ) Actions as space-time shapes . IEEE Transactions on Pattern Analysis and Machine Intelligence , 29 ( 12 ), pp. 2247 – 2253 . IEEE . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 88 Iosifidis , A. , Tefas , A. and Pitas , I. ( 2015 ) On the kernel extreme learning machine classifier . Pattern Recog. Lett. , 54 , 11 – 17 . Google Scholar Crossref Search ADS WorldCat © The Author(s) 2019. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
TI - Improving Human Action Recognition Using Hierarchical Features And Multiple Classifier Ensembles
JF - The Computer Journal
DO - 10.1093/comjnl/bxz123
DA - 2020-04-01
UR - https://www.deepdyve.com/lp/oxford-university-press/improving-human-action-recognition-using-hierarchical-features-and-P7CpHEXZbq
SP - 1
VL - Advance Article
IS - 
DP - DeepDyve
ER -