TY - JOUR
AU - Zaied,, Mourad
AB - Abstract Emotion recognition is a key work of research area in brain computer interactions. With the increasing concerns about affective computing, emotion recognition has attracted more and more attention in the past decades. Focusing on geometric positions of key parts of the face and well detecting them is the best way to increase accuracy of emotion recognition systems and reach high classification rates. In this paper, we propose a hybrid system based on wavelet networks using 1D Fast Wavelet Transform. This system combines two approaches: the biometric distances approach where we propose a new technique to locate feature points and the wrinkles approach where we propose a new method to locate the wrinkles regions in the face. The classification rates given by experimental results show the effectiveness of our proposed approach compared to other methods. 1. INTRODUCTION Emotion recognition [1–4] is a relevant technique to analyze and understand human behavior. It has also many vital applications in various fields such as virtual reality, social advertisement, human robotic interaction and data-driven animation. The main target of emotion recognition is to identify the human emotional state (e.g. anger, joy, surprise, disgust, fear and sadness) based on the given datasets. It should be pointed out that it is a challenging task to, automatically, recognize human emotions with high accuracy. Over the past two decades, a variety of emotion recognition methods have been proposed in literature among which we cited emotion recognition by body gesture analysis [5]. In fact, this approach [5] aims at analyzing the actions of a person in order to determine emotional state. The analysis is based on the position and the movements of the upper part of the body (hands and head). It is restricted to the trajectory of points and their velocity. These characteristic points form a triangle whose perimeter gives information on the qualitative aspects of their movement based on the approach proposed by Glowinski and Camuri. [5] Starting from low-level physical measures of the body, they identified a high level model. It provides a set of qualitative aspects of the movement. In other words, the transformation of the low level model which corresponds to the physical measures (the position, velocity and acceleration of superior body parts) in a high level model such as righteousness, impulsiveness and fluency provides these qualitative aspects that allow in their turn to recognize a particular emotion. We can, also, mention the emotion recognition methods based on analyzing words [6, 7]. Thanks to automatic speech emotion recognition systems, the machine becomes able to transform a signal into a sequence of words. But we must go further and learn the meaning of the word sequences and know the context of the pronunciation of the uttered sentences. It is at this level that the emotional dimension is involved. So, we must take into account the intonation of the sentence in order to distinguish between a statement and a question. In addition, we mention methods using electroencephalograph (EEG) [8–12]. In the EEG-based emotion recognition, EEG signal preprocessing is conducted in several steps, including channel, time segment and frequency band selection. A number of studies worked on features identification of EEG signals for emotion recognition. Jenke et al. [8] include three domains of signal feature extraction methods: time, frequency and time-frequency. Several recent studies in emotion recognition from EEG signals believe that time domain features such as fractal dimension, Hjorth parameters, and statistical features are useful to identify the emotion characteristics from time-series EEG signal [13]. In the frequency domain features extraction method, the power features calculated by Fast Fourier Transform (FFT) has been widely used in emotion study from EEG signal [13]. In addition to these approaches, many approaches are based on facial expressions analysis [14–17]. Facial expression is a kind of visible manifestation of spirit’s state, of cognitive activities, physiological activities (tiredness, pain), the character and the psychopathology of someone. Psychology researches have shown that facial expressions play an important role in coordinating human conversations, and have a greater impact on the listener than the textual content of the expressed message. As the face is the most expressive part in the body [18]. This paper aims at developing a process of operating a facial emotion recognition system. It presents a hybrid emotion recognition system based on the analysis of the shapes of the wrinkles and the biometric distances. This paper aims at presenting the following contributions: A new method to, automatically, locate the features points on the face. A new method to, automatically, locate the winkles regions on the face. The rest of the paper is organized as follows: Section 2 introduces literature related work. The proposed methods of both biometric distances as well as the shapes of the wrinkles are described in Section 3. Section 4 provides a comparison between our method and other online approaches in literature to demonstrate the advantages of the proposed methods. The final section gives a conclusion of this study. 2. RELATED WORK The purpose of this part is to present a general idea about some recent works based on facial expression analysis. Existing facial expression recognition methods can, mainly, be categorized into two classes: the template-based approaches as well as the feature-based approaches [4, 19]. 2.1. Template-based approach These approaches use a holistic or a hybrid way to face representation and apply a template-based method for facial expression information extraction from an input image sequence. Lyons et al. [20] use a labeled graph to represent the face. Each node of the graph consists of an array, which is called jet. Each component of a jet is the response filter of a certain Gabor wavelet [21] extracted at a point of the input image. Hong et al. use wavelets of five different frequencies and eight different orientations. They defined two labeled graphs, named general face knowledge (GFK). A big GFK is a labeled graph with 50 nodes, where to each node, a 40-component jet of the corresponding landmark, extracted from 25 individual faces has been assigned. A small GFK is a labeled graph with 16 nodes. Each node has a 12-component jet (four wave field orientations and three frequencies) that has been extracted from a set of eight faces. The small GFK is used to find the face location in an input facial image as well as the big GFK is used to locate the facial features. Wiskott utilizes the Person Spotter system [22] as well as the method of elastic graph matching proposed by Pentland et al. [23] to fit the model-graph to a surface image. First, the small GFK is moved as well as scaled over the input image until a place of the best match is found. After the matching is performed, the exact face position is derived from the canonical graph size value (the mean Euclidean distance of all nodes from the center of gravity). Then, the big GFK is fitted to the cropped face region as well as a node-weighting method is applied. A low weight is assigned to the nodes on the face and hair boundary and a high weight is assigned to the nodes on the facial features. Essa and Pentland [24] used a hybrid approach to face representation. At first, they used the Eigenspace method in order to track the face in the scene as well as extract the locations of the eyes, nose, and mouth. The technique of extracting the prominent facial features employs Eigen features approximated on a sample of 128 images. The eigen features define a so-named feature space. To detect the location of the prominent facial features in an image, the distance of each feature-image from the important feature space is computed. The position of the prominent facial features is further used to normalize the input image. A 2D spatiotemporal motion energy representation of facial motion estimated from two consecutive normalized frames is used as a dynamic face model. In fact, the Eigenspace method approximates the face vectors (face images) by lower dimensional feature vectors. It considers an off-line phase or training, where the projection matrix, the one that achieves the dimensional reduction, is obtained using all the database face images. In the off-line phase, the mean face and the reduced representation of each database image are also computed. The recognition process works as follows. A preprocessing module transforms the face image into a unitary vector and then performs a subtraction of the mean face. The resulting vector is projected using the projection matrix that depends on the Eigenspace method. This projection corresponds to a dimensional reduction of the input. Then, the similarity of the projected vectors with each of the reduced vectors is computed using a certain criterion of similarity (Euclidean distance for example). The class of the most similar vector is the result of the recognition process. In addition, a rejection system for unknown faces is used if the similarity matching measure is not good enough. 2.2. Feature-based approach The feature-based approach aims at extracting information from geometrical and appearance features [25, 26]. Geometric features provide the locations, shape of facial components, and the appearance features express the facial appearance changes, like furrows, gapes, wrinkles, and bulges. We can find many techniques based on geometrical and appearance features in literature. In [27, 28], the authors proposed a geometric face model. Their real-time system works with online taken images of subjects with no facial hair or glasses facing the camera while sitting at approximately 1 m away from it. In [29], they use a point-based model composed of 2D facial views, the frontal as well as the side view. The frontal-view face model contains 30 features. From these, 25 features are defined in correspondence with a set of 19 facial points and the rest are some specific shapes of the mouth and chin. The utilized side-view face model consists of 10 profile points, which correspond to the peaks and valleys of the curvature of the profile contour function. To localize the contours of the prominent facial features and then extract the model features in a dual-view input, Pantic and Rothkrantz apply multiple feature detectors for each prominent facial feature (eyebrows, eyes, nose, mouth, and profile). In order to locate the eyes, they use methods proposed in [30], and [31] with [32]. Then, the best of the acquired (redundant) results is chosen. All this job is based on both, the knowledge of facial anatomy (used to check the correctness of the result of a certain detector) and the confidence in the performance of a specific detector (assigned to it based on its testing results). The performance of the detection scheme is tested on 496 dual views. The rates of facial features detection achieved 89%. Holt et al. [33] also utilize a point-based frontal-view face model but do not deal with automatic facial expression data extraction. They use 10 facial distances, to manually measure 94 images chosen from the facial emotion database assembled by Ekman and Friesen [34, 35]. This information is utilized, subsequently, for classification. In our work, we focus on this approach which aims at extracting data from geometrical and appearance features thanks to its simplicity and accuracy. 3. PROPOSED APPROACH The proposed system consists of the combination of two approaches: the first one is the wrinkles approach, the second is an approach which is based on the biometric distances. The proposed system can be illustrated in Fig. 1. As shown in this figure, the system is divided into two approaches from each one, we will extract Euclidean distances in order to classify emotions. To extract information using the wrinkles approach, we start with the first step which is the detection of face’s elements, after that the location of wrinkles regions in the face in which we propose a new technique to locate these regions, then, information extraction. Simultaneously, we extract information using the biometric distances approach following these steps: detection of face elements, location feature points in which we propose a new technique to locate them and track characteristic points. FIGURE 1. View largeDownload slide Overview of the proposed system. FIGURE 1. View largeDownload slide Overview of the proposed system. 3.1. Biometric distances approach This approach contains three stages. In order to achieve the first step ‘face detection elements’ we use Viola and Jones detector based on Haar-like feature. Then, we propose an automatic and a simple method to locate the characteristic points which we called the five rectangles method. We use the optical flow to track them. After the detection of the face’s elements, the system locates 38 points and computes 21 Euclidean distances in the neutral state (the state of a person before expressing an emotion). Then, it tracks the feature points during the emotion, calculates again the new distances which are the variations between distances in the neutral state and those calculated during the emotion. 3.1.1. Step 1: Detection of face’s elements Face detection is a topic of growing interest. Numerous methods have emerged in the last two decades. Let’s start with the methods based on the extraction of the invariant characteristics parameters [36]. The algorithms of these methods are based on the principle that facial features do not change by rotating the face. Let’s take the example of Silva et al.’s algorithm [36] which presents a typical example of methods based on characteristics. Indeed, the algorithm covers the top and the bottom of the face in order to detect the eye plan. It is characterized by an increase in the density of edges. In addition, the length between this plan and the top of the face is considered as reference length. It came to build a « template » containing the facial features (mouth, eyes…). Others report a success rate of 82% in the detection of all the facial characteristics. Moreover, the algorithm is able to detect the characteristic of different ethnicities, but it cannot, properly, detect if the image contains glasses if the forehead is covered. There are also other methods which are based on previous knowledge. These methods are used to localize the face. They focus on the characteristic of the face (nose, mouth, eyes). Kotropoulos and Pitas [37, 38] have succeeded in locating the characteristics of the face depending on Kanade’s [39] method of projection after detecting the contours of a face. But, the weakness of this approach is due to the background of image which contains many objects not only faces. In addition, there are methods that are based on the appearance. These techniques are based on the idea that the face detection problem is a classification problem (face, non-face). They give good results, but the calculation time remains very important. In 2001 a relevant step was taken with the publication of Viola and Jones method [40]. It is a method that can make effective real time detection. So, we use this detector to realize the first stage of our emotion recognition system. Viola and Jones [40] proposed Haar-like features for rapid object detection. A Haar-like feature is composed of several white and black areas. The intensity values of pixels in the white or black areas are separately accumulated. Then the feature value is computed with a weighted combination of these two sums. The detector is trained through Adaboost algorithm and, finally, composed of a cascade of classifiers. The cascade of classifier contains several classifiers called weak learners. The principle of this approach considers that the majority of research windows do not contain the interest object [40]. As the Viola and Jones detector contains several classifiers, if a negative sample is classified as false positive, it will be corrected by the next stage as the same for the positive samples classified as false negative. So this kind of cascaded structure can achieve increased detection performance while radically reducing computation time. This phase is a key to our emotion recognition system. It provides us with the location of rectangles from which we propose the location of the characteristic points. 3.1.2. Step 2: Location feature points In this step we suggest an automatic technique which we have called the five rectangles method in order to locate feature points, whereas in several other works it has been, manually, initialized in the face [17]. The approach [18] has automatically placed the points, but the method with which it has located these points first requires the location of the three axes called principal axes: eye axis, mouth axis, and nose axis. The nose axis is an axis that divides the rectangle enclosing the face into two equal parts. But this is not always the case. Indeed, the following figure, Fig. 2, presents an example of a head in an inclined position, so the axis dividing the rectangle enclosing the face into two equal parts does not correspond to the nose axis. FIGURE 2. View largeDownload slide Example of head in an inclined position. FIGURE 2. View largeDownload slide Example of head in an inclined position. Moreover this method requires a lot of calculation. However, we have localized these points more easily and faster than the anthropometric model. In contrast to the manual methods, we suggest an automatic and new technique to locate the features points. To find the location of each point in the face of a person, we rely on the result of the selection boxes (BBOX) of the detection of face’s elements to determine the location of characteristic points. The BBOX returns an M-by-4 matrix defining M bounding boxes including the detected objects. This technique performs multi-scale object detection on the input image. Each row of the output matrix, BBOX, has a four-element vector, [x y width height], that specifies in pixels, the upper left corner and size of a bounding box as shown in the following equations. The input image must be a grayscale or true color (RGB) image. We will locate 38 landmarks in the face. There are 18 static points and 20 dynamic points shown in Fig. 3. We will start with the determination of the positions of static points. FIGURE 3. View largeDownload slide Static points–dynamic points. FIGURE 3. View largeDownload slide Static points–dynamic points. 3.1.3. Determination of static points positions To find the positions of P7 ⁠, P8 ⁠, P9 ⁠, and P10 ⁠, we focus on the positions of the eye rectangles. So, these points are placed in half width of the eye rectangles as shown in Fig. 4. FIGURE 4. View largeDownload slide Positions of some static points. BBOXFD=[xFDyFDwidthFDheightFD] (1) BBOXLED=[xLEDyLEDwidthLEDheightLED] (2) BBOXRED=[xREDyREDwidthREDheightRED] (3) BBOXND=[xNDyNDwidthNDheightND] (4) BBOXMD=[xMDyMDwidthMDheightMD] (5) FIGURE 4. View largeDownload slide Positions of some static points. BBOXFD=[xFDyFDwidthFDheightFD] (1) BBOXLED=[xLEDyLEDwidthLEDheightLED] (2) BBOXRED=[xREDyREDwidthREDheightRED] (3) BBOXND=[xNDyNDwidthNDheightND] (4) BBOXMD=[xMDyMDwidthMDheightMD] (5) To find the positions of P11 ⁠, and P12 ⁠, we focus on the positions of the face detection rectangle. So, these points are placed in half width of this rectangle. For the rest of the dynamic points we depend on the face detection rectangle in order to place them. With: fd: face detection led: left eye detection red: right eye detection nd: nose detection md: mouth detection 3.1.4. Determination of dynamic points positions To find the positions of P27 and P28 ⁠, we focus on the positions of the nose rectangle. So, these points are placed exactly in half width of this rectangle as shown in Fig. 5. To find the positions of P30 ⁠, P35 ⁠, P32 ⁠, and P38 ⁠, we focus on the positions of the mouth rectangle. So, P30 and P35 are, exactly, placed in half height of the mouth rectangle. P32 ⁠, and P38 ⁠, are, exactly, placed in half width of the mouth rectangle. To determine the rest of dynamic points, we focus on the positions of eyes rectangles. Table 1 presents the coordinates of some feature points. FIGURE 5. View largeDownload slide Positions of some dynamic points. FIGURE 5. View largeDownload slide Positions of some dynamic points. Table 1. Coordinates of feature points. Points Coordinates p7 (⁠ xLED ⁠, widthLED/2 ⁠) p8 (⁠ xLED+hightLED ⁠, widthLED/2 ⁠) p9 (⁠ xRED ⁠, widthRED/2 ⁠) p10 (⁠ xRED+hightRED ⁠, widthRED/2 ⁠) p11 (⁠ xFD ⁠, widthFD/2 ⁠) p12 (⁠ xFD+hightFD ⁠, widthFD/2 ⁠) p27 (⁠ xND ⁠, widthND/2 ⁠) p28 (⁠ xND+hightND ⁠, widthND/2 ⁠) p30 (⁠ xMD+hightMD/2 ⁠, yMD ⁠) p35 (⁠ xMD+hightMD/2 ⁠, yMD+widthMD ⁠) p32 (⁠ xMD ⁠, widthMD/2 ⁠) p38 (⁠ xMD+hightMD ⁠, widthMD/2 ⁠) Points Coordinates p7 (⁠ xLED ⁠, widthLED/2 ⁠) p8 (⁠ xLED+hightLED ⁠, widthLED/2 ⁠) p9 (⁠ xRED ⁠, widthRED/2 ⁠) p10 (⁠ xRED+hightRED ⁠, widthRED/2 ⁠) p11 (⁠ xFD ⁠, widthFD/2 ⁠) p12 (⁠ xFD+hightFD ⁠, widthFD/2 ⁠) p27 (⁠ xND ⁠, widthND/2 ⁠) p28 (⁠ xND+hightND ⁠, widthND/2 ⁠) p30 (⁠ xMD+hightMD/2 ⁠, yMD ⁠) p35 (⁠ xMD+hightMD/2 ⁠, yMD+widthMD ⁠) p32 (⁠ xMD ⁠, widthMD/2 ⁠) p38 (⁠ xMD+hightMD ⁠, widthMD/2 ⁠) Table 1. Coordinates of feature points. Points Coordinates p7 (⁠ xLED ⁠, widthLED/2 ⁠) p8 (⁠ xLED+hightLED ⁠, widthLED/2 ⁠) p9 (⁠ xRED ⁠, widthRED/2 ⁠) p10 (⁠ xRED+hightRED ⁠, widthRED/2 ⁠) p11 (⁠ xFD ⁠, widthFD/2 ⁠) p12 (⁠ xFD+hightFD ⁠, widthFD/2 ⁠) p27 (⁠ xND ⁠, widthND/2 ⁠) p28 (⁠ xND+hightND ⁠, widthND/2 ⁠) p30 (⁠ xMD+hightMD/2 ⁠, yMD ⁠) p35 (⁠ xMD+hightMD/2 ⁠, yMD+widthMD ⁠) p32 (⁠ xMD ⁠, widthMD/2 ⁠) p38 (⁠ xMD+hightMD ⁠, widthMD/2 ⁠) Points Coordinates p7 (⁠ xLED ⁠, widthLED/2 ⁠) p8 (⁠ xLED+hightLED ⁠, widthLED/2 ⁠) p9 (⁠ xRED ⁠, widthRED/2 ⁠) p10 (⁠ xRED+hightRED ⁠, widthRED/2 ⁠) p11 (⁠ xFD ⁠, widthFD/2 ⁠) p12 (⁠ xFD+hightFD ⁠, widthFD/2 ⁠) p27 (⁠ xND ⁠, widthND/2 ⁠) p28 (⁠ xND+hightND ⁠, widthND/2 ⁠) p30 (⁠ xMD+hightMD/2 ⁠, yMD ⁠) p35 (⁠ xMD+hightMD/2 ⁠, yMD+widthMD ⁠) p32 (⁠ xMD ⁠, widthMD/2 ⁠) p38 (⁠ xMD+hightMD ⁠, widthMD/2 ⁠) 3.1.5. Step 3: Tracking feature points Tracking an object through a video occupies a prominent place in several areas related to computer vision as surveillance, robotics…etc. The quality of monitoring and the real-time tracking are the most important factor of this task. Several methods have been, thoroughly, proposed in literature to meet both requirements. In fact, these methods involve decomposing the tracking phenomenon in two parts. The first part is the modeling of the object to be tracked i.e. extracting feature vectors from the latter. The second part is the monitoring vector collecting the extracted features. We are going to use a technique based on optical flow. The using of the optical flow provides information on the movements of each pixel of the image. Thus, it measures the displacement vectors starting from the pixels’ intensity of two consecutive images or temporarily closes. Indeed, the inactive pixels possess a zero velocity in opposition to pixels belonging to dynamic objects. To perform this step, we must have a basis of videos to apply the monitoring approach. In the second step, the system locates 38 landmarks and computes 21 Euclidian distances in the neutral state. In this step, it tracks the located landmarks during the emotion, calculates again the new distances, which are the variations between distances in the neutral state and those computed during the emotion. These distances are the data extracted from this approach. 3.2. Wrinkles approach This section presents the approach of the wrinkles as well as its different parts. Our approach is based on the analysis of the wrinkles [4]. The face is the most expressive part in the body. Facial expression is due to the activation of one or more facial muscles, which produces a permanent deformation of facial features (eyes, eyebrows and mouth) and the appearance of transitional traits in a particular region of the face. The most important wrinkles of the face are: the upper of the nose, the forehead, the corners of the eyes, the corners of the mouth and the chin region. The proposed approach contains three steps: detection of face’s elements, location of wrinkles regions in the face in which we contribute, as well as information extraction. 3.2.1. Step1: detection of face’s elements This phase consists in detecting the face as well as its elements (eyes, nose and mouth). To achieve it, we use Viola and Jones detector. 3.2.2. Step2: location the wrinkles regions The objective of this phase is to locate seven boxes: a box on the forehead, another on the chin, two boxes on the corners of the eyes, two boxes on the corners of the mouth and a box on the upper part of the nose. To find the location of these regions in the face of a person, we follow the same principle of the location of feature points. So, we are based on the result of the selection boxes (BBOX) of the detection of face’s elements to determine the location of wrinkles regions. To find the positions of the forehead, the corners of the eyes, and the upper part of the nose regions, we focus on the positions of the eyes rectangles. To find the positions of the chin and the corners of the mouth regions, we focus on the position of the mouth rectangle. We are going to locate seven regions on the face at the neutral state as shown in Fig. 6a and we will relocate them on the face during the emotion as shown in Fig. 6b. To locate them, our method is very simple, merely, because at the first stage we have already used the Viola and Jones detector to locate: the face, the eyes as well as the mouth. FIGURE 6. View largeDownload slide (a) The wrinkles regions at the neutral state (b) the wrinkles regions during the emotion. FIGURE 6. View largeDownload slide (a) The wrinkles regions at the neutral state (b) the wrinkles regions during the emotion. The detection of the face as well as its elements we facilitate the location of the wrinkles regions as we know the coordinates of points characterizing each rectangle. We know also its dimensions (length, wideness…). So, our method is not only simple but also automatic. 3.2.3. Step3: information extraction In this step, we are going to extract information from the detected regions by calculating the edge pixels number of each facial region expression as well as at the neutral state. Then, we calculate the difference between the edge pixels number at the neutral state in addition to the edge pixels number during the emotion. 4. CLASSIFICATION Before the classification, we will normalize the different data collected from the two approaches. 4.1. Data normalization The idea is to measure the similarity between the query image descriptor D and n components and all images of the database with descriptors Di with i ∈ [1…n] and n is the total number of images in the database. The distances similarity of the image i are calculated by applying the Euclidean distance. If we note, for example, MinDS and MaxDS the minimum and maximum values of the distances similarity n images of the dataset, the normalized value NDSi is calculated as follows: NDSi=DSi–MinDS/MaxDS−MinDS The classification is the last stage in our system. In this stage we use the wavelet networks [41–44] based on the Fast Wavelet Transform (FWT) in order to classify the basic emotions (joy, anger, sadness, neutral, disgust, fear, surprise). Wavelet networks are a new theory, which have been introduced by Zhang and Benveniste [42] in 1992. They used a combination of artificial neural networks based on radial basis, function and wavelet decomposition. Moreover, these authors [42] have explained how a wavelet network can be generated. It is defined by pondering a set of wavelets dilated and translated from one mother wavelet with weight values to approximate a given signal f. Equation (1.1) represents the output of the network using finite number of wavelets. f˜=∑i=1nωiψi (1.1) Many authors [45] used the projection technique of the signal f on the dual basis of the wavelets and the hidden layer’s scaling functions in order to calculate the output weight connections of the wavelet network. This type of technique offers precise weights’ values, but it has a main defect when we like to determinate the hidden layer’s weights to the output layer, because it leads to calculate the matrix’s inversion Φ which needs an intensive calculation as the matrix is so large. But our technique was based on the FWT. The FWT is an algorithm designed to turn a waveform or signal in the time domain into a sequence of coefficients based on an orthogonal basis of small waves, or wavelets. The transform can be easily extended to multidimensional signals, such as images, where the time domain is replaced with the space domain. The FWT facilitates the computing on the level of approximation and details using a simple and fast method. The principle of this step is to create a wavelet that models each vector of pixels of learning. The vectors contain the data extracted from the used approaches. To create the network of each vector the stages are as follow: prepare the wavelet and scaling functions, compute the weights by FWT, compute the contributions from the library function, and choose the best features by setting a stopping criterion. After these steps, we will obtain the vector of weights which belongs to the best input of each learning vector. After the phase of learning, the phase of test decides which class a vector of pixels test corresponds to. After the projection of each vector of the test on the network of all the vectors of training, we obtain the weight of this vector. Then, we calculate the distance between the vector of weight of the training and the test. After that, we sort the obtained distances. Finally, the algorithm decides the class of the test vector by getting the smallest distance. The objective of the classification is to predict the class of new objects, in other words, the class of test objects that are not presented during the learning phase. This phase is the most important part of our work. We tried to make changes in order to enhance classification rates of facial expressions. Also, to obtain classification rates better than the classification rates obtained by Abdat [18]. To achieve this step we use the wavelets networks trained by 1D FWT [46, 47]. 4.2. Fast wavelet transform The objective of FWT [48] is to calculate the approximation as well as the detail coefficients using other techniques, simpler than methods based on projection on the dual basis. These are the different steps to be followed: Prepare the wavelet as well as the scaling functions (library activation network). Calculate the weights using the FWT. Compute the contributions from each library function. Select the features which best approximate the vector at the output of the network by setting a stopping criterion. Figure 7 presents these different steps. At the end of this process, we get the weight vector corresponding to the best contributions of each learning vector. FIGURE 7. View largeDownload slide Training phase based on FWT. FIGURE 7. View largeDownload slide Training phase based on FWT. 5. RESULTS AND DISCUSSION In this section, we present, first, the datasets to evaluate the performance of our proposed methods. Then, the obtained results are analyzed and compared to other methods. We used the Chon-Kanade data set. It is a Carnegie Mellon University database [49, 50]. It contains a set of facial expressions images in grayscale for men and women of different ethnicities. The size of each image is 640 by 490 pixels. The orientation of the camera is front and the small movements of the head are present. This data set is frequently used for facial expression recognition. Figure 8 presents some samples of the data basis. FIGURE 8. View largeDownload slide Samples of Chon-Kanade dataset. FIGURE 8. View largeDownload slide Samples of Chon-Kanade dataset. We used a second dataset named Japanese Female Facial Expression (JAFFE) [51]. The JAFFE dataset has 213 images of seven facial expressions posed by 10 Japanese female models. Each image has been rated on six emotion adjectives by 60 Japanese subjects. The dataset was planned and assembled by Michael Lyons, Miyuki Kamachi, and Jiro Gyoba [52]. Figure 9 presents some samples of the data basis. FIGURE 9. View largeDownload slide Samples of JAFFE dataset. FIGURE 9. View largeDownload slide Samples of JAFFE dataset. 5.1. Results of the biometric distances approach 5.1.1. Face elements’ detection To evaluate the first stage, we use two datasets, the Chon-Kanade and JAFFE. Our method of the detection of face elements has been successfully tested on our datasets. This method gives the result in a display of five blue rectangles in the level of the face, eyes, nose, and mouth as shown in Fig. 10. These rectangles are built using the bounding boxes. We note that some face elements are not surrounded by detection rectangles which present the limitation of this method. FIGURE 10. View largeDownload slide Face elements’ detection. FIGURE 10. View largeDownload slide Face elements’ detection. 5.1.2. Locating feature points The results obtained with the five rectangles method are very satisfactory. Our technique locates 38 points in the face as shown in Fig. 11. The location of feature points is correctly done in the majority of the images except some particular persons who have so big noses or mouths. FIGURE 11. View largeDownload slide Location of feature points. FIGURE 11. View largeDownload slide Location of feature points. 5.1.3. Tracking feature points To evaluate the tracking phase, a base of movies is created from the Chon-Kanade data set. In Fig. 12, we can see the tracking of characteristic points using the optical flow. The tracking is correctly done in the majority of the basis images. In fact, it correctly follows the points. But we notice that the algorithm is easily affected by the lightening of the image which explains the loss of a few points during this phase. FIGURE 12. View largeDownload slide Tracking feature points. FIGURE 12. View largeDownload slide Tracking feature points. 5.1.4. Results of the wrinkles approach The experimental results obtained by the wrinkles approach is presented and discussed in this section. We have used two different datasets which are the Chon-Kanade and JAFFE. These two datasets contain the seven emotional facial expressions (joy, disgust, neutral, sadness, fear, surprise, anger). The rates are shown in Fig. 13. FWT has correctly classified the neutral emotion with a rate equal to 100%. The wrinkles regions of this emotion are not modified during the emotion state, the vector of difference of pixels is null so we have no confusions about this emotion. The system classified joy and disgust classes with no more, and no less, elevated rates. For the rest of classes are classified with low rates compared to the other classes. First, because we have a problem of detection in the wrinkles regions. Second, the datasets contain images of persons who don’t express these emotions with the same manner. In addition, there are persons who express the sadness emotion when their eyebrows are curved and their mouth are tight, however, there are other persons who express this emotion with released eyebrows and tight mouths. FIGURE 13. View largeDownload slide Classification rates of Chon-Kanade and JAFFE dataset with FWT FIGURE 13. View largeDownload slide Classification rates of Chon-Kanade and JAFFE dataset with FWT That’s why we proposed a hybrid system which will better improve these rates. 5.2. Results of the hybrid system In order to validate the presented method, a number of experiments were carried out. We used two types of features, we extract features from the wrinkles regions as well as the variations of biometric distances during the emotion. From each approach, we normalized and combined the extracted distances in order to distinguish the seven class expressions. From Figs 14 and 15, we can see that we combine these two features the FWT achieves the best performance in the seven class expressions compared with the wrinkles approach as well as approach [4]. Ghanem and Caplier [4] proposed an emotion recognition system based on facial deformations of permanent and transient features. This system recognizes a considered expression and quantifies it using the Transferable Belief Mode (TBM). Figure 14. View largeDownload slide Comparison of the classification rates by the hybrid system and the wrinkles approach. Figure 14. View largeDownload slide Comparison of the classification rates by the hybrid system and the wrinkles approach. FIGURE 15. View largeDownload slide Comparison of the classification rates by the hybrid system and other approach. FIGURE 15. View largeDownload slide Comparison of the classification rates by the hybrid system and other approach. We can also observe that seven expressions neutrality, joy, anger, surprise and fear achieve excellent performances owing to their distinctive features in the key regions of the face. Meanwhile, the expressions of disgust and sadness provide satisfactory results. It is a pity that the accuracy percentages of disgust and sadness expressions have a slightly poorer recognition performance than other expressions, because they only have subtle nuance of expression in both shape and appearance features. In general terms, expressions are easily confused due to the similarity in shape and appearance features, and the individual variations for the same expression. We notice that our system is more robust and performant than the wrinkles approach. Finally, a comparison of performances for 7-class facial expressions recognition between our method and other four approaches, which can be seen in more details in [4, 53–55], is conducted to validate the performance of the proposed approach of facial expression recognition, as shown in Table 2. Table 2. Comparison of accuracy between our method and other methods. Method Overall accuracy of JAFFE dataset (%) Ghanem and Caplier [4] 52 P. Khorrami et al. [53] 82.43 V. Chernykh et al. [54] 73 Y. Fan et al. [55] 79.16 Our method 94.4 Method Overall accuracy of JAFFE dataset (%) Ghanem and Caplier [4] 52 P. Khorrami et al. [53] 82.43 V. Chernykh et al. [54] 73 Y. Fan et al. [55] 79.16 Our method 94.4 Table 2. Comparison of accuracy between our method and other methods. Method Overall accuracy of JAFFE dataset (%) Ghanem and Caplier [4] 52 P. Khorrami et al. [53] 82.43 V. Chernykh et al. [54] 73 Y. Fan et al. [55] 79.16 Our method 94.4 Method Overall accuracy of JAFFE dataset (%) Ghanem and Caplier [4] 52 P. Khorrami et al. [53] 82.43 V. Chernykh et al. [54] 73 Y. Fan et al. [55] 79.16 Our method 94.4 As observed in Table 2, obviously, the proposed method outperforms the other four approaches presented in [4, 53–55] by as much as 42.4%, 11.97%, 21.4%, and 15.24%, respectively. Based on the above analysis, the great performance of our method is due to not only the combination of two approaches which provides abundant information for classifier to discriminate different expressions, but also the FWT which acts as a robust classifier because of its powerful feature learning ability. 6. CONCLUSION In this paper, we presented a hybrid emotion recognition system based on two approaches: biometric distances approach and wrinkles approach. The first approach contains three steps: face elements detection, location features points, and tracking feature points. The second approach consists also of three steps: face elements detection, location of wrinkles regions and information extraction. We combine the extracted information from the two approaches, then, we classified it using fast wavelet transform. We contribute to two levels: to the location of feature points as well as the location of wrinkles regions. The results obtained are satisfactory and ensure the robustness of the proposed approach. FUNDING The authors would like to acknowledge the financial support of this work by grants from General Direction of Scientific Research (DGRST), Tunisia, under the ARUB program. REFERENCES 1 Sokolov , D. and Patkin , M. ( 2018 ) Real-Time Emotion on Mobile Devices. 13th IEEE Int. Conf. Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May, pp. 787–787, IEEE. 2 Jain , N. , Kumar , S. , Kumar , A. , Shamsolmoali , P. and Zareapoor , M. ( 2018 ) Hybrid deep neural networks for face emotion recognition . Pattern Recognit. Lett. , 115 , 101 – 106 . Google Scholar Crossref Search ADS 3 Lukose , S. and Upadhya , S.S. ( 2017 ) Music Player Based on Emotion Recognition of Voice Signals, Int. Conf. Intelligent Computing, Instrumentation and Control Technologies (ICICICT), Kannur, India, 6–7 July, pp. 1751–54, IEEE. 4 Ghanem , K. and Caplier , A. ( 2013 ) Towards a full emotional system . Behav. Inf. Technol. , 32 , 783 – 799 . Google Scholar Crossref Search ADS 5 Glowinski , D. and Camurri , A. ( 2008 ) Technique for Automatic Emotion Recognition by Body Gesture Analysis, Proc. IEEE Computer Society Conf. Vision and Pattern Recognition, workshop on Human Communicative Behavior Analysis, Anchorage, AK, USA, 23–28 June, pp. 1–6, IEEE. 6 Bustamante , P.A. , Lopez Celani , N.M. , Perez , M.E. and Quintero Montoya , O.L. ( 2015 ), Recognition and Regionalization of Emotions in the Arousal-Valence Plane,. 37th Annual Int. Conf. IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 Aug, pp. 6042–6045, IEEE. 7 Mariooryad , S. and Busso , C. ( 2016 ) Facial expression recognition in the presence of speech using blind lexical compensation . IEEE Trans. Affect. Comput. , 7 , 346 – 359 . Google Scholar Crossref Search ADS 8 Jenke , R. , Peer , A. and Buss , M. ( 2014 ) Feature extraction and selection for emotion recognition from EEG . IEEE Trans. Affect. Comput. , 5 , 327 – 339 . Google Scholar Crossref Search ADS 9 Bajaj , V. and Pachori , R.B. ( 2015 ) Detection of human emotions using features based on the multiwavelet transform of EEG signals. In Hassanien , A.E. and Azar , A.T. (eds.) Brain-Computer Interfaces . Vol. 74, pp. 215–240. Springer . 10 Bajaj , V. and Pachori , R.B. ( 2014 ) Human Emotion Classification from EEG Signals Using Multiwavelet Transform. Int. Conf. Medical Biometrics, Shenzhen, China, 30 May–1 June, pp. 125–130, IEEE. 11 Krishna , A.H. et al. ( 2018 ) Emotion classification using EEG signals based on tunable-Q wavelet transform . IET Sci., Meas. Technol. , 1 – 7 . doi: 10.1049/iet-smt.2018.5237. 12 Bajaj , V. , Taran , S. and Sengur , A. ( 2018 ) Emotion classification using flexible analytic wavelet transform for electroencephalogram signals . Health Inf. Sci. Syst. , 6 , 12 . Google Scholar Crossref Search ADS PubMed 13 Zhang , Y. , Ji , X. and Zhang , S. ( 2016 ) An approach to EEG-based emotion recognition using combined feature extraction method . Neurosci. Lett. , 633 , 152 – 157 . Google Scholar Crossref Search ADS PubMed 14 Wu , C.-H. et al. ( 2013 ) Speaking effect removal on emotion recognition from facial expressions based on Eigenface conversion . IEEE Trans. Multimed. , 15 , 1732 – 1744 . Google Scholar Crossref Search ADS 15 Halder , A. et al. ( 2011 ) Emotion Recognition from Facial Expression using General Type-2 Fuzzy Set, Proc. Int. Conf. Soft Computing for Problem Solving, December 20–22, Springer. 16 Hakura , J. , Domon , R. and Fujita , H. ( 2013 ) Emotion Recognition Method Using Facial Expressions and Situation. 12th Int. Conf. Intelligent Software Methodologies, Tools and Techniques (SoMeT), Budapest, Hungary, 22–24 September, IEEE. 17 Chang , C.-Y. , Tsai , J.-S. , Wang , C.-J. and Chung , P.-C. ( 2009 ) Emotion Recognition with Consideration of Facial Expression and Physiological Signals. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology(CIBCB), 278–283, IEEE. 18 Abdat , F. , Maaoui , C. and Pruski , A. ( 2011 ) Human-Computer Interaction Using Emotion Recognition from Facial Expression. UKSim 5th European Symposium on Computer Modeling and Simulation. 19 Huang , C.L. and Huang , Y.M. ( 1997 ) Facial expression recognition using model-based feature extraction and action parameters classification . J. Vis. Comm. Image Representation , 8 , 278 – 290 . Google Scholar Crossref Search ADS 20 Lyons , M.J. , Akamatsu , S. , Kamachi , M. and Gyoba , J. ( 1998 ) Coding Facial Expressions with Gabor Wavelets, Proc. Int’l Conf. Automatic Face and Gesture Recognition, 200–205. 21 Steffens , J. , Elagin , E. and Neven , H. ( 1998 ) Person Spotter Fast and Robust System for Human Detection, Tracking, and Recognition. Proc. Int’l Conf. Automatic Face and Gesture Recognition, 516–521. 22 Wiskott , L. ( 1995 ) Labelled Graphs and Dynamic Link Matching for Face Recognition and Scene Analysis , Vol. 53 . Verlag Harri Deutsch , Reihe Physik, Frankfurt am Main . 23 Pentland , A. , Moghaddam , B. and Starner , T. ( 1994 ) View-Based and Modular Eigenspaces for Face Recognition. Proc. Computer Vision and Pattern Recognition, 2277, 84–91. 24 Khorrami , P. , Paine , T.L. , Brady , K. , Dagli , C. and Huang , T.S. ( 2016 ) How Deep Neural Networks can Improve Emotion Recognition on Video Data. IEEE Conference on Image Processing (ICIP), pp. 619–623, IEEE. 25 Kato , M. , So , I. , Hishinuma , Y. , Nakamura , O. and Minami , T. ( 1991 ) Description and Synthesis of Facial Expressions Based on Isodensity Maps. In Kunii , T. (ed.) Visual Computing , pp. 39 – 56 . Springer-Verlag . 26 Kearney , G.D. and McKenzie , S. ( 1993 ) Machine interpretation of emotion: design of memory-based expert system for interpreting facial expressions in terms of signaled emotions (JANUS) . Cogn. Sci. , 17 , 589 – 622 . Google Scholar Crossref Search ADS 27 Kobayashi , H. and Hara , F. ( 1992 ) Recognition of Six Basic Facial Expressions and Their Strength by Neural Network. Proc. Int’l Workshop Robot and Human Comm., 22, 381–386. 28 Kobayashi , H. and Hara , F. ( 1992 ) Recognition of Mixed Facial Expressions by Neural Network, Proc. Int’l Workshop Robot and Human Comm., 387–391. 29 Pantic , M. and Rothkrantz , L.J.M. ( 2000 ) Expert system for automatic analysis of facial expression . Image Vision Comput. J. , 18 , 881 – 905 . Google Scholar Crossref Search ADS 30 Vincent , J.M. , Myers , D.J. and Hutchinson , R.A. ( 1992 ) Image Feature Location in Multi-Resolution Images Using a Hierarchy of Multi-Layer Preceptors, Neural Networks for Speech, Vision, and Natural Language , pp. 13 – 29 . Chapman & Hall . 31 Kass , M. , Witkin , A. and Terzopoulos , D. ( 1987 ) Snake: Active Contour Model. Proc. Int’l Conf. Computer Vision, 259–269. 32 Hara , F. and Kobayashi , H. ( 1997 ) State of the art in component development for interactive communication with humans. Adv. Rob. , 11 , 585 – 604 . Google Scholar Crossref Search ADS 33 Holt , R.J. , Huang , T.S. , Netravali , A.N. and Qian , R.J. ( 1997 ) Determining articulated motion from perspective views . Pattern Recognit. , 30 , 1435 – 1449 . Google Scholar Crossref Search ADS 34 Hong , H. , Neven , H. and von der Malsburg , C. ( 1998 ) Online Facial Expression Recognition Based on Personalized Galleries. Proc. Int’l Conf. Automatic Face and Gesture Recognition, 354–359. 35 Hu , H. , Xu , M.X. and Wu , W. ( 2007 ) GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition. In ICASSP, IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vol. 4, pp. 413–416. 36 Zhang , L. , ( 1996 ) Estimation of Eye and Mouth Corner Point Positions in a Knowledge Based Coding System. Proc. SPIE Digital Compression Technologies and Systems for Video Communications. 37 Kotropoulos , C. and Pitas , I. ( 1997 ) Rule based face detection in frontal views . Speech Signal Process. , 4 , 2537 – 2540 . 38 Kotropoulos , C. , Tefas , A. and Pitas , I. ( 1998 ) Frontal Face Authentication Using Variants of Dynamic Link Matching Based on Mathematical Morphology. Proc. IEEE International Conf. Image Processing, 122–126. 39 Rowley , H. , Baluja , S. and Kanade , T. ( 1996 ) Human Face Detection in Visual Scenes. In Touretzky , D.S. , Mozer , M.C. and Hasselmo , M.E. (eds.) Advances in Neural Information Processing Systems . Springer . 40 Rao , S. , Pramod , N.C. and Paturu , C.K. ( 2008 ) People Detection in Image and Video Data. Proc. 1st ACM Workshop on Vision Networks for Behavior Analysis, Canada, pp. 85–92. 41 Afdhal , R. , Bahar , A. , Ejbali , R. and Zaied , M. ( 2015 ) Face Detection Using Beta Wavelet Filter and Cascade Classifier Entrained with Adaboost. 8th Int.Conf. Machine Vision (ICMV‘8) November 19–21, Barcelona, Spain. 42 Afdhal , R. , Ejbali , R. and Zaied , M. ( 2017 ) Emotion Recognition Using the Shapes of the Wrinkles, ICCIT 2016, pp. 191–195, Bangladesh. 43 Afdhal , R. , Ejbali , R. and Zaied , M. ( 2017 ) A Hybrid System Based on Wrinkles Shapes and Biometric Distances for Emotion Recognition, ACHI, March, pp. 206–211, Nice, France. 44 Khatrouch , M. , Gnouma , M. , Ejbali , R. and Zaied , M. ( 2018 ) Deep Learning Architecture for Recognition of Abnormal Activities, ICMV. International Society for Optics and Photonics, p 106960F, Vienna. 45 Zaied , M. , Said , S. , Jemai , O. and Ben Amar , C. ( 2011 ) A novel approach for face recognition based on fast learning algorithm and wavelet theory . Int. J. Wavelets Multiresolut. Inf. Process. , 9 , 923 – 945 . Google Scholar Crossref Search ADS 46 Zaied , M. , Ben Amar , C. and Alimi , M.A. ( 2005 ) Beta wavelet networks for face recognition . J. Decis. Syst. , 14 , 109 – 122 . Google Scholar Crossref Search ADS 47 Afdhal , R. , Ejbali , R. , Zaied , M. and Amar , C.B. ( 2014 ) Emotion Recognition Using Features Distances Classified by Wavelets Network and Trained by Fast Wavelets Transform. Int. Conf. Hybrid Intelligent Systems, Kuwait, 14–16 December, pp. 238–241, IEEE. 48 Gnouma , M. , Ejbali , R. and Zaied , M. ( 2018 ) Abnormal events’ detection in crowded scenes . Multimedia Tools Appl. , 77 , 24 843 – 24 864 . Google Scholar Crossref Search ADS 49 Lanjewar , R.B. , Mathurkar , S. and Patel , N. ( 2015 ) Implementation and comparision of speech emotion recognition system using Gaussian mixture model (GMM) and K-nearest neighbor (K-NN) techniques . Procedia Comput. Sci. , 49 , 50 – 57 . Google Scholar Crossref Search ADS 50 http://www.consortium.ri.cmu.edu/ckagree/ 51 http://www.kasrl.org/jaffedb_info.html 52 Lyons , M.J. , Akamatsu , S. , Kamachi , M. and Gyoba , J. ( 1998 ) Coding Facial Expressions with Gabor Wavelets Proceedings. Third IEEE Int. Conf. Automatic Face and Gesture Recognition, Nara Japan, April 14–16, pp. 200–205, IEEE Computer Society. 53 Khorrami , P. , Paine , T.L. , Brady , K. , Dagli , C. and Huang , T.S. , ( 2016 ) How Deep Neural Networks can Improve Emotion Recognition on Video Data. IEEE Conf. Image Processing (ICIP). 54 Chernykh , V. , Sterling , G. and Prihodko , P. ( 2017 ) Emotion Recognition From Speech With Recurrent Neural Networks, arXiv:1701.08071v1. 55 Fan , Y. , Lu , X. , Li , D. and Liu , Y , ( 2016 ) Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks. ICMI ‘16 Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450. © The British Computer Society 2019. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
TI - Emotion Recognition by a Hybrid System Based on the Features of Distances and the Shapes of the Wrinkles
JF - The Computer Journal
DO - 10.1093/comjnl/bxz032
DA - 2020-03-18
UR - https://www.deepdyve.com/lp/oxford-university-press/emotion-recognition-by-a-hybrid-system-based-on-the-features-of-c3bPzIeAeC
SP - 1
VL - Advance Article
IS - 
DP - DeepDyve
ER -