TY - JOUR AU - Picard, Rosalind, W AB - Abstract Using a deliberately slow computer–game-interface to induce a state of hypothesised frustration in users, we collected physiological, video and behavioural data, and developed a strategy for coupling these data with real-world events. The effectiveness of our strategy was tested in a study with thirty six subjects, where the system was shown to reliably synchronise and gather data for affect analysis. A pattern-recognition strategy known as Hidden Markov Models was applied to each subject's physiological signals of skin conductivity and blood volume pressure in an effort to see if regimes of likely frustration could be automatically discriminated from regimes when frustration was much less likely. This pattern-recognition approach performed significantly better than random guessing at classifying the two regimes. Mouse-clicking behaviour was also synchronised to frustration-eliciting events and analysed, revealing four distinct patterns of clicking responses. We provide recommendations and guidelines for using physiology as a dependent measure for HCI experiments, especially when considering human emotions in the HCI equation. 1 Introduction Affective computing has been described as “computing that relates to, arises from, or deliberately influences emotions” (Picard, 1997). Why build an affective computer? At present, computer systems interact with users in ways that do not allow for the complexities of naturalistic social interaction. Yet, recent evidence demonstrates that humans have an inherent tendency to respond to media in ways that are natural and social, mirroring ways that humans respond to one another in social situations (Reeves and Nass, 1996). Computer systems can build on this predisposition by offering users interaction enriched with affective understanding. Current systems are still impoverished in the options they have for both understanding communication from the user, and communicating to the user. A computer that could decode and produce affective responses has the potential for significant improvement in its interactive capabilities. This line of inquiry has widespread implications for HCI, ranging from better educational software to improved computer-mediated communication. Already affect synthesis and analysis are beginning to drive the ways that researchers think about and build interactive computer systems. Some have begun to build computer systems that can synthesize emotional states;1 1 By synthesis of emotional states, we mean the capability of some current computer systems to maintain a model of drives and behaviours, enabling rudimentary simulation of emotional states, primarily for the purpose of aiding human–computer interaction. others have focused on pattern recognition techniques to infer aspects of the user's emotional state from audio, video and other signal inputs. In each case, the design and construction of computational systems, which attempt to infer characteristics of a user's emotional state, must make sense of multiple and co-occurring situational variables. Before researchers can adequately address the minutiae in a given emotional exchange, we must understand how to accurately contextualise the user's response within his surroundings. Only then will we enable computers to interpret the emotional response as it is embedded within what the user is doing and experiencing. This paper is an initial attempt to address the emotional response and contextual information together. 1.1 The experiment We designed an experiment that investigated the use of physiological sensors as input to infer the user's emotional state. Here, we wish to define what we will mean by emotional state inference: The pattern analysis of signals, with the intent of the programmer to have such analysis produce a label or response characterising emotional or affective qualities of the inputs. Subjects played a computer game where the goal was to complete a series of visual puzzles as quickly and as accurately as possible in order to win a cash reward. At repeated random intervals, the computer mouse was rigged to behave as though it were faulty (i.e. when the user clicked to move forward to the next puzzle, nothing occurred for several seconds). Our interest was in eliciting frustration, a multifaceted emotional state, hypothesised to occur during a multi-second window after each failed mouse-click. For the duration of the experiment, subjects wore two different physiological sensors on their bodies. Our dependent variables were multiple, and included these physiological measures as well as subjects' mouse-click behaviour and other relevant contextual information about the game. We analysed these variables throughout the periods following mouse failures, comparing them to episodes of normal activity (no mouse failures). It should be mentioned at the outset that this paper places a primary focus on the methodological treatment of these issues, in addition to describing the outcome variables. We discovered the need to invent brand-new protocols for managing multiple data inputs. This process was a learning experience, and in itself comprises a large percentage of this document. While we are pleased to report encouraging initial data and analyses, another of our main objectives is to describe what we learned during the process of collecting and making sense of several channels of data. The two concepts–method of data collection and results in recognising affective information from the data–are coupled, but it is important to remember that successful data synchronisation and collection does not imply successful affect pattern recognition. The latter is a notoriously difficult problem, highlighted by a longstanding debate in the emotion theory literature about whether or not emotions can even be differentiated by physical responses. Consequently, the results presented here go beyond describing a methodology for gathering data about emotional expression; they also begin to address a larger debate about which physical signals manifest differentiation with emotional state. One of our most significant contributions is the recommendation of a model of data gathering that can help HCI researchers explore the potential of using multiple sensing technologies. This model should be robust enough to work with various subsets of sensors, be they physiological, non-physiological, or a combination of both. 1.2 Background 1.2.1 Physiological sensing Why did we choose to measure physiology as an index of affective state? Several other methods are available for giving a computer access to aspects of the emotional response. Video and audio analysis to recognise facial expressions, gestures, and voice are obvious options, especially since they are readily communicated over a distance. The physiological response tends to be harder to make sense of, and may require physical contact between user and computer system for sensing. Additionally, skin-surface sensing may at first seem undesirably obtrusive. Most current-day physiological sensors have a rather clunky interface and dangling wires that can be bothersome. However, physiological sensing is gradually moving into devices that people are naturally in physical contact with. Although the sensors used in our initial experiments described below were standard medical sensors placed on the hand, these same sensors have also been built into jewellery, shoes, clothing, eyeglasses and a mouse (Ark et al., 1999; Marrin and Picard, 1998; Picard and Healey, 1997; Scheirer et al., 1999). Further, it is interesting to consider some of the pros and cons of sensing with ‘highly public’ means such as cameras and microphones, versus with relatively ‘intimate’ means such as skin-surface sensors. Although the former involves no physical contact, and certainly provides an easy-to-understand means of communication, it can also be viewed as an invasion of privacy that is hard for the user to control. The user may want here emotion communicated but not want her appearance transmitted or her voice recorded. Furthermore, it may be hard for a single user to disable a computer vision or voice recognition system that is built into a ‘smart room’. By building physiological sensors into wearable systems, or embedding them in traditional input devices such as the mouse or keyboard, the user retains primary control. Of course, it is important to remember that the kind of information sensed by a physiological sensor is inherently different from that sensed by cameras or microphones in that it is not usually public or under the user's control. That said, the user maintains the choice of physically removing or disabling the sensors easily and whenever he wants, and he can be assured that these signals do not provide identifying information, as would video face recognition or audio speaker-identification systems. There is mounting evidence suggesting that physiological signals may have characteristic patterns for specific emotional states (e.g. Cacioppo and Tassinary, 1990; Ekman et al., 1983; Vyzas and Picard, 1998). However, emotion researchers still argue about the definition of emotion and what constitutes an emotional state, so that it is still very hard to compare results of efforts to recognise emotions from physiology. Many researchers eschew the use of categorical labels for emotional states and instead describe emotion by a set of two or more dimensions. The most common two dimensions for describing emotion are arousal (activation or excitement level), and valence (the positive or negative quality of the emotion) (Lang et al., 1993; Schlosberg, 1954). For example, both anger and fear belong to the ‘high arousal, negative valence’ category, happiness belongs to the ‘high arousal, positive valence’ category, and sadness belongs to the ‘low arousal, negative valence’ category. Physiological signals such as skin conductivity, heart rate, and muscle tension may provide key information regarding the intensity and quality of an individual's internal experience. These kinds of signals are easily digitised and may eventually be unobtrusively monitored, making them very accessible to pattern recognition techniques. Although debate exists regarding the specificity of signals to particular emotional states, we suggest that psychophysiological data may at least provide information regarding the valence and arousal of the user's internal state, and may be helpful by acting in tandem with computer vision, hearing, and natural language processing to make computers more aware of user affect. Attention to methodological detail is necessary in order to address the complexity and high individual variability in physiological reaction to external and internal events. Applied psychophysiological research explores the relationship between social/behavioural phenomena and physiological events and principles. Cacioppo and Tassinary (1990) explore the nature of psychophysiology–emotion relationships, considering several categories of connections: one-to-one (i.e. one physiological signal maps to one particular emotion), many-to-many, one-to-many, and many-to-one. In the case of our frustration experiment, we allowed for the many-to-one case, assuming that multiple features of a series of signals might provide the most information about an elicited reaction. Our work builds on a history of use of psychophysiological signals in HCI experimentation. Much of the research into this area has focused on mental workload rather than components of the emotional response system (such as frustration), but is still relevant to our experiment. Many different sensors and signals have been explored by researchers during the past 20 years, including but not limited to: cortisol levels, heart rate variability, respiration, electrodermal activity, event-related brain potentials, electroencephalography (EEG), palmar sweat, and pupil diameter (Kiesler et al., 1985; Kramer, 1991; Wiethoff et al., 1991). The reader is referred to Wiethoff et al. (1991) or Kramer (1991) for excellent meta-reviews of the history of physiological metrices as variables in human–computer interaction studies, as well as a discussion of the pros and cons of each signal. Several researchers have made the argument for the use of psychophysiology as viable research methods for HCI, Wastell (1990), in particular. Psychophysiological measures add information that is special and unique, and the resulting information can help designers create better systems (Henning et al., 1995). Recently, some have begun not only to view psychophysiology as useful for the experimenter, but also for the computer system itself. Rowe et al. (1998) suggest that giving computers access to such signals should be a goal for HCI designers, so that the system may more easily adapt itself to the individual user. It should be noted from the outset that there are both advantages and disadvantages to using psychophysiological measures in HCI research. Unfortunately, using physiological signals necessitates specialised equipment and technical expertise to run the equipment. Also, it can be quite difficult to separate confounding factors that may be influencing the physiological reaction in order to attribute significant changes to the experimental variable under investigation (Kramer, 1991). However, at least some sensors can be monitored without interfering with the experimental task, especially in the case of something like a heart-rate monitor, which is worn under the clothing and once attached is relatively unobtrusive to the subject. Physiological variables can also give researchers insight into the short-term (occurring within seconds) shifts that occur during a task that may not be measurable by any other means (Kramer, 1991). It is important, however to recognise that physiological measurements alone are not adequate to give a coherent picture of what affective state is occurring within the user (Henning et al., 1995; Wastell, 1990; Wilson and Eggemeier, 1991). While some may argue that using psychophysiology is too reductionistic, such fears can be mitigated by considering the signals in complementary terms with behavioural and other variables, such as self-report instruments. Henning et al. (1995) echo this approach in their claim that psychophysiological measures are seldom meaningful at all unless analysed along with other measures. Our experiment followed such guidelines by examining a behavioural variable (mouse clicks) in tandem with psychophysiological signals. Lastly, it is highly recommended that researchers use multiple physiological measures rather than a single signal. This approach is advantageous because each measure has a unique sensitivity for certain aspects of the body's response system—combined, the analysis becomes stronger (Wilson and Eggemeier, 1991). We have chosen to implement this recommendation in our experiment as well, by considering multiple physiological signals together. Two physiological signals were chosen for the current experiment, although we do not claim that the two we chose are optimal for measuring frustration. These two signals are galvanic skin response (GSR) and blood volume pressure (BVP). We will focus on these two measures in the rest of this paper for concreteness, but we stress that the methodological principles described here are independent of the specific signals measured. GSR, also sometimes called skin conductivity or electrodermal response (Dawson et al., 1990), measures a phenomenon of human physiology in which the skin momentarily becomes a better conductor of electricity when either external or internal stimuli occur that are physiologically arousing. It has been closely linked to emotion and attention. It is measured by passing a small current through a pair of electrodes placed on the surface of the skin and measuring the conductivity level. Essentially, as the body becomes increasingly aroused, the palmar sweat increases, conducting the signal better. Increased arousal thus potentiates the current. GSR is highly influenced by frustrative non reward situations, and has often been used to measure subjects' reactions to a situation or discrete stimulus that elicits anxiety. GSR is also one of the signals used in the polygraph or ‘lie detector’ test. BVP, also known as blood volume pulse or peripheral blood flow measurement (Papillo and Shapiro, 1990), uses the light absorption characteristics of blood to measure the blood flow through skin capillary beds in the finger (a technique known as photoplethysmyography.) Small capillaries such as these tend to contract upon subjects' experience of an anxiety-provoking stimulus, causing the envelope of the signal to ‘pinch’ inwards. This is often referred to as the “cold feet” phenomenon, in which blood tends to drain from one's extremities during periods of emotional duress. The periodic component of this signal can also provide heart rate, which if measured precisely enough, can be used to extract heart-rate variability, which may give clues to valence (Rowe et al., 1998). 1.2.2 Physiology, frustration, and human–computer interaction While researchers have not yet discovered universal physiological profiles for individual emotion labels, one component of emotion—arousal—can be measured relatively robustly with the skin conductivity response. Arousal is a broad term referring to overall activation, or ‘emotional excitedness’, and is widely considered to be one of the two main dimensions of an emotional response. Measuring arousal is therefore not the same as measuring an emotional state such as frustration in its entirety, but is an important component of it. Additionally, the arousal component of frustration tends to be of negative valence. For the purposes of our experiment, we were essentially trying to characterise periods of increased negative arousal as correlated with mouse click and screen events. Frustration theory, studied in the psychology community since the 1930s, has been historically difficult to define. Since frustration was described in the psychological literature for the first time during the rise of the behaviourists, much of the work on frustration has involved animal behaviour. Lawson describes Rosenzweig's theory of frustration (Lawson, 1965) as “the occurrence of an obstacle that prevented the satisfaction of a need”. Others have paired frustration with aggression (Lawson, 1965). In Lawson's formulation, the occurrence of frustration always increases the tendency for an organism to respond aggressively, i.e. a rat will increase its vigour when an obstacle is placed between it and its reward. Behaviourist principles state that one of the principal independent variables (causes) of frustration is the delayed reinforcement (reward) of a conditioned response (Amsel, 1992). In a traditional experimental design, this might be implemented as a delay in delivery of food (reward) after a trained animal presses the correct lever (response). In our experiment, the lever pressing is analogous to clicking the mouse to advance the screen, and the delivery of food corresponds to screen advancement. These concepts of reward and delay are also familiar to the HCI community as issues of immediate feedback and user control. Such concepts underlie user-interface guidelines that are long-established in the field of HCI, and are part of what are known as principles of Direct Manipulation (Mayhew, 1992; Schneiderman, 1986). While our experiment was not exactly analogous to the rat scenario (i.e. our ‘reward’ of screen advancement is actually a step towards a more substantial reward—the cash prize), we still set the user up to be frustrated by prohibiting him from gradual progress toward the goal. We have hypothesised, therefore, that if we introduce a delay like this in the game's response to the user's actions, the result would be similar to the animal's frustration response. While we cannot directly measure the complexity of any resulting frustration, we can measure the arousal component of this response, as described above, together with changes in blood flow and in behaviour variables, such as increased mouse clicking behaviour directly after the delay event. Our experimental design purposefully exploits the violation of these feedback and user control guidelines. In a companion paper (Klein et al., 2002), it was verified that inserting unwanted delays into the user's task led to significantly more frustration in users compared to a control group performing the same task without the delays. If it is true that users consistently achieve a state of high arousal and negative valence in direct, repeated response to such flouted rules of immediate feedback and control, an added value of work such as ours is to provide yet further confirmation of the theory that these design guidelines are valid and necessary. In the future, when researchers have been more successful in correlating physiological response to specific affective states, we expect that the categorisation will go from the broad classification of arousal that we attempted in this experiment, to the sensing of more subtle states, such as frustration, interest, and confusion. 2 The pilot study This study was executed with prior approval of MIT's Committee on the Use of Human Experimental Subjects, in accordance with their ethical guidelines of privacy, deception, and subject rights. 2.1 Subjects Thirty-six undergraduate and graduate students participated in this experiment, recruited through flyers posted in and around the MIT campus. They were told that the experiment would last for one hour and they would receive 10 US$ for their participation. Subjects were led to believe that their task would be ‘participation in a visual cognition game’, a believable story, given the fact that the experiment took place in the Vision and Modelling group at the MIT Media Laboratory. If subjects were told up front that the goal was to try to frustrate them, then most of them probably would not have become frustrated. Consequently, it was necessary to initially deceive the subjects in order to elicit the desired emotional reaction in ways that closely resembled a real-life situation. All subjects were debriefed afterwards as to the true nature of the experiment, and reminded of their rights to have their data withdrawn if they wished. 2.2 Materials 2.2.1 Psychophysiology sensing system The sensing system consisted of GSR and BVP sensors attached to the first three fingers of the subject's non-dominant hand (Fig. 1). Subjects used their dominant hand for the mouse. Several subjects also wore an electromyograph (EMG, or muscle-activation) sensor on their same-side shoulder. This data was not included in the present analysis because not all subjects underwent that part of the experiment. Fig. 1 Open in new tabDownload slide Detail of biosensor placement on the subject's non-dominant hand. Fig. 1 Open in new tabDownload slide Detail of biosensor placement on the subject's non-dominant hand. The sensors attached via wires to a ProComp Plus analog-to-digital unit. The ProComp Plus (Thought Technology, http://www.thoughttechnology.com) is a multimodality, 8-channel, medically approved, safe system for monitoring of biosignals, which converts the analog signals into digital form. The ProComp unit was connected through fibre-optic cable and adapter to a Toshiba 110CS Satellite laptop PC computer with a 10-inch colour display that was hidden from the subject's view, although in the same room. The laptop computer silently recorded the signals from the ProComp Plus unit at 20 samples/s, using software designed by Thought Technology, and running under DOS. 2.2.2 Game system hardware and software The game system (see Fig. 2) consisted of a Power Macintosh 8500/180 with one large, 21 inches colour monitor that displayed the experimental game, and a second 13 inches colour monitor that displayed a large (124 pt.) digital clock. The rigged game system described before ran on this Macintosh system. Fig. 2 Open in new tabDownload slide Experimental set-up. Fig. 2 Open in new tabDownload slide Experimental set-up. We designed, built and tested an interactive software game specifically for this experiment using Macromedia Director 5.0 for the Macintosh. The system development underwent six iterations of design, prototyping, user testing and redesign, over a six-week period. The game consisted of a series of 40 similar visual puzzles (Fig. 3), each on a separate screen in modal succession. Fig. 3 Open in new tabDownload slide A typical puzzle with clock showing elapsed time. Fig. 3 Open in new tabDownload slide A typical puzzle with clock showing elapsed time. 2.2.3 Other equipment A video camera recorded the subject's upper torso and hands, as well as the elapsed time of the experiment on the smaller monitor, which faced the camera. 2.3 Pilot study procedures Upon responding to the flyers of requests, subjects were scheduled for a one-hour time slot. They were then told the ‘cover story’: that the purpose of the experiment was our interest in how their physiology would react to a series of brightly coloured graphics as they interacted with the game. After subjects arrived at the lab, they were asked to read and sign MIT's standard subject's rights forms, and then were ushered into a conference room where the experiment took place. They were then given the game instructions. The game consisted of a series of puzzles, and the task was to click the mouse on the correct box at the bottom of the screen that corresponded to the items of which there were ‘the most’ on the above array. This mouse-click also advanced the screen to the next puzzle. Subjects received $10 for their participation, but the game was also a competition; the individual who received the best overall score and speed at the end of the data collection was told s/he would receive a one hundred dollar prize. This incentive was intended to mimic a real-life situation where users would be racing toward a goal. At irregular intervals, a delay occurred during which the mouse appeared not to work properly. If questioned, the experimenter nonchalantly answered, “Oh, it sticks sometimes. Please keep going”. 2.4 Design results The experiment ran smoothly on 36 subjects, successfully producing tightly synchronised streams of mouse-click behaviour, video, physiological signals, and events in the “game”. All subjects were asked, prior to the debriefing, whether they suspected the true nature of the experiment. Of the 36 subjects, seven suspected the deception. One of these subjects was very suspicious and actually resented being deceived. This subject's data was excluded from the analysis. Of the remaining six subjects, one did not have sufficient data to be used (the protocol for determining this will be explained in Section 3), and the other five subjects were included in the study (see Fig. 8 for a cross comparison of these five subjects with the rest of the subject population). In summary, the design methodology described before was embodied in a working system and was found to be successful for eliciting two episodes: (1) ‘all is going smoothly’, and (2) ‘the system is impeding the user's goal’. We now turn to the pattern recognition section of this paper, which examines whether the physiology and behaviour of the user showed any distinctive differences during these two episodes. 3 Pattern recognition Data analysis of human physiology and behaviour is a complex problem. Several factors, both external and internal, shape the output of the sensors. The goal here was to use physiological data to see whether the computer could be taught to identify and discriminate differences between how a user responded when ‘all was going smoothly’ vs. how a user responded when ‘the system was not working properly’. We also analysed the behavioural data from the mouse for cues to different patterns people use in responding to perceived system delays. The video data was not used in the pattern recognition analysis below, but is available for future work. 3.1 Physiological data modelling In choosing a model that adequately captures the behaviour of the physiological signals, we need to consider the dynamic or time-evolving nature of the signals. Also, in order to make these models robust to variations, we should consider probabilistic models. One of the most successful techniques in the pattern recognition literature is that of Hidden Markov Models (HMMs). HMMs have been successfully used to model time series like speech, and are currently used in both speaker-dependent and speaker-independent speech recognition systems. In Section 3.1.1, we highlight some of the issues involved in hidden Markov modelling for the benefit of the reader interested in the details of the computational models applied to the problem of classifying physiological signals. This section assumes that the reader is familiar with the basic theory of hidden Markov models and is included here for the reader who may be interested in applying these techniques to similar problems or interested in reproducing similar results. The reader not interested in such details is encouraged to bypass Section 3.1.1 and resume reading in Section 3.2. 3.1.1 Hidden Markov models We can think of a HMM as a machine that has a particular structure—determined by the number of states, how these are connected, and the types of output distributions associated with each—and a set of parameters associated with the structure, for instance, the parameters of a distribution and transition probabilities. When the structure is fixed, there exist efficient algorithms that allow us to estimate the parameters. Finding an optimal structure, however, is a difficult problem, so we have opted for a simple approach, namely, to select a subset of structures, train for each, and then evaluate the performance for each one of them for each subject in the data set. Due to the continuous nature of the physiological observations, we use continuous HMMs and model the output distributions as a mixture of Gaussian probability distributions. We then define a subset of HMM structures by varying the number of states (between 4 and 7); the number of Gaussians in the mixtures of the densities (1 or 2); the form of their covariance matrices (diagonal or full); and the topology of the HMM (left-to-right or fully connected). All these can be considered free parameters that need to be fixed a priori before applying the learning algorithm. Since the objective is to find a possibly user-dependent structure, we have to treat the 32 possible combinations that result by varying the parameters above for each one of the subjects. We used the standard Baum–Welch algorithm for estimating the HMM parameters for each HMM configuration for each subject (Rabiner and Juang, 1986). One of most important issues is how to obtain a set of features from the raw data that might have correlates with internal affective states. This is still an open research question: the mappings between affective and physiological states are being investigated at large in the psychophysiology community. In deciding on a feature set, we should account for classical measures while bearing in mind that we can also allow the models we are using to exploit more complex dynamic patterns that might not have received much attention in other studies. We have proposed the following set of five features from the raw data (for details see Fernandez (1997)): the GSR signal detrended by subtracting a time-varying sample mean (found with a moving 10 sec window). a local time-varying unbiased sample variance of the signal in (i) (found with a moving 10 sec window). the “pinch” of the BVP or difference between the upper and lower envelopes of the BVP signal (the upper and lower envelopes are the signals that bound the BVP and which are shown in Fig. 4 with dashed lines) the variation (first difference) of the peak-to-peak interval of the BVP signal. the local variance of the detail coefficients in a 3-level wavelet expansion of the BVP signal. Fig. 4 Open in new tabDownload slide Example of BVP signal. The upper and lower envelopes are shown with dashed line. For the segment of BVP shown, the pinch is smaller on the left section of the signal. Fig. 4 Open in new tabDownload slide Example of BVP signal. The upper and lower envelopes are shown with dashed line. For the segment of BVP shown, the pinch is smaller on the left section of the signal. The GSR signal varies its baseline unpredictably across an experimental session. For this reason, the GSR features extracted on (i) and (ii) remove this time variation and examine the local amplitude and variance of the signal. Feature (iii) from the BVP signal captures how much the amplitude of the signal constricts or expands. Feature (iv) provides an approximation to heart rate variability. Finally, feature (v) provides a different measure of frequency variation over time by doing a wavelet expansion of the signal and analysing the local variance of the wavelet coefficients (over a 1.5 sec window). Because the time series obtained in (iv) and (v) are sparser than the original time series, these values have been interpolated to obtain time series of equal length, which we can then stack in a five-dimensional feature vector. 3.2 Establishing a ground truth We wish to treat the data analysis as a classification problem and determine whether we can characterise and predict possible instances of frustration from a set of observed physiological readings. Before proceeding, a ground truth needs to be established, to which the classifications will be compared. A “ground truth” is parlance for what the model should assume is the correct answer for the classification problem; it is the set of ideal labels assigned to the data used to learn the parameters of a model. Establishing an appropriate ground truth is of paramount importance because these labels determine the subsets of data used to learn the parameters corresponding to the model associated with each class. Furthermore, the ground truth, once established, becomes the ‘gold standard’ against which we compare the results of a classification (the output of the system after it has learnt to distinguish between classes); it is by this standard that we assess the performance of the system in properly recognising the labels. Assigning these labels or categories to the data is a non-trivial problem, which deserves careful consideration since the class categorisations we shall use to label the data have only been induced not firmly established. In other words, there is uncertainty associated with the class to which the data belongs. There is, for instance, a possibility that a stimulus failed to induce a frustration response, and conversely, that a subject showed a frustration response in the absence of the controlled stimulus due to another uncontrolled stimulus, such as a cognitive event. Using the convention of labelling the portions of the data in which the subject experienced frustration as F and those in which he did not as NF, the previous statement amounts to saying that a label of F may be assigned to NF data and vice versa. We will assign the labels of F and NF to the data based on the heuristic principle that a F label should occur shortly after a frustrating stimulus took place, and an NF label when it did not. Alternatively, one might consider interviewing the subject to gather self-report on the possible occurrence of these events instead of relying on the presence or absence of a frustrating stimulus. However, it is not always possible to stop and ask the subject for confirmation at each instant, as that would disturb the experiment. Moreover, self-report data on emotions is notoriously variable, depending on many factors unrelated to the subject's emotion. Consequently, we cannot claim that the two episodes we distinguish truly correspond to frustration and to non-frustration; all we can say is whether the game was proceeding smoothly or not, and whether a difference showed up in the person's physiology, as detected by the applied models. In the classical recognition problem, a set of data is used for learning the properties of the model under the different classes to recognise. The classification of this training data is usually fixed, and this knowledge is then used to characterise the probabilistic properties of each separate class. We do not wish to abandon this framework and will adopt a deterministic rule to label the training examples. However, establishing a proper labelling for the training data is one of the aspects of this problem that should be adaptive and subject to further discussion. Our primary belief about what class the data belongs to is given by the onset of the controlled stimuli that gave the appearance of the mouse breaking. An intuitive approach to define the classes is to consider the response following each stimulus as representative of a frustration episode. How we establish the timing deserves some attention. The time window to capture this response has to be wide enough to allow a latency period that naturally precedes the physiological response. For GSR, this delay can be as much as 3 sec (Helander, 1978). Fig. 5 illustrates the principle used to label the data portion between any two stimuli. Fig. 5 Open in new tabDownload slide Ground truth labelling. Fig. 5 Open in new tabDownload slide Ground truth labelling. This figure shows a portion of a GSR signal between two stimuli for a particular subject (instances when the mouse appeared to fail) represented by the bold vertical bars (ti and ti+1). Following the onset of one stimulus, we allow a dormant period of 1 s to pass before assigning the labels; then we window the following 10 sec of data as representative samples of the class we want to model as frustration (F). Since the boundaries that define the frustration episode are not known with precision, we allow another dormant period (of 5 s) without any classification, and then consider the rest of the signal up until the next stimulus to correspond to the class of non-frustration (NF). If the remaining set of samples is less than a minimum number (3 sec in these simulations), then a label is not assigned to this region. If a person progressed so rapidly that the time windows used on two adjacent stimuli overlapped (the stimuli were spaced out by less than 11 s), then the two resulting segments of data labelled as F are merged together. The chosen labels F and NF may be viewed as positive and negative examples of the phenomenon we want to model. The reader should bear in mind that this is a simplified mnemonic and modelling device and not an argument for what the true state of the person's emotion is since human physiology exhibits complex variation and comprises only a part of emotional experience. The regions labelled F or NF roughly correspond to where we have a higher degree of confidence about the class induced, relative to the unlabelled ‘don't-care’ regions. 3.3 Evaluation and discussion The results reported below apply to 24 subjects that had sufficient experimental data (2 or 3 sessions, total). On three occasions, a subject's session had to be discarded due to technical difficulties (e.g. a sensor fell off, or a signal was not recorded properly); in this case, if the subject had additional sessions, the subsequent data were still used. It was found that 12 subjects had only one session, which was not enough data to both train and test the recognition system. There was also one subject who chose to withdraw his data–this subject was not included in the pool of 36 subjects described here. We divided the experimental sessions for each subject into a training and a testing set. For the 11 subjects with 2 sessions, one session was randomly selected for training and the other for testing. For the remaining 13 subjects who had three sessions, the testing session was selected randomly as the second or third session, and the remaining two sessions were used for training. After training each HMM structure for a total of 24 subjects, the training and testing data were parsed (segmented into regimes labelled as F or NF) using Viterbi decoding (Viterbi decoding is a standard algorithm that is used in HMM modelling to segment a time series according to different models (Rabiner and Juang, 1986); these models correspond in this case to the F and NF categories we have discussed earlier). To evaluate the performance of the system, we calculated the percentage of data samples that had been correctly classified (this evaluation criterion, of course, only applies to labelled samples; the do not-care regions are left out of the evaluation), and the HMM that performed the best on the testing set for each subject was chosen. The percentage of properly classified data samples was used to measure the recognition performance. The performance to beat was that of a random classifier which outputs a decision (F or NF) on every data point with equal chance (random guessing, therefore, is 50%). Performance was evaluated on the training set (the set of data used to train the system and learn the parameters of the model) as well as on a testing set (a set of data unseen by the system). This distinction is necessary because a system will typically exhibit a bias to perform best on the set of examples from which it has learnt; a true description of its performance requires an evaluation of how well it is able to generalize the results to a previously unseen data set. In order to assess the significance of the results, we compared the error rate produced by the classifiers against that of a random classifier that outputs an error rate of 50%. At a 95% confidence interval (p<0.05), the overall (F and NF combined) performance for the training set was significantly better than random for all 24 subjects (the mean value of the recognition rate was 81.87%). For the testing set, overall performance was significantly better than random for 21 of the 24 subjects (the mean value of the overall recognition rate was 67.40% and was 71.85% for the 21 subjects who achieved rates better than random). The histograms show the distribution of the overall recognition rated for all subjects (Fig. 6), as well as the distribution of the recognition rates for the individual categories, F and NF (Fig. 7). The height of each bar is proportional to the number of subjects for whom the system attained the accuracy shown on the horizontal axis. A disparity can be seen between the recognition rates for the F and NF classes in Fig. 7; this may reflect the uncertainty we have in the ground truth of these data. The histograms show that performance is subject dependent. Fig. 7 Open in new tabDownload slide Histogram of recognition rates for F and NF labels (training and testing sets). Fig. 7 Open in new tabDownload slide Histogram of recognition rates for F and NF labels (training and testing sets). Fig. 6 Open in new tabDownload slide Histogram of overall recognition rates (training and testing sets). Fig. 6 Open in new tabDownload slide Histogram of overall recognition rates (training and testing sets). It should be noted, however that a fairer assessment of the performance of a system of this kind would take into account prior knowledge about the likelihood of occurrence and duration of each label, which is likely to change as a function of personality, time–pressure, etc. Because of the nature of the experiment, each subject spent a variable amount of time on each experimental session. However, in the ground truth, the duration of each frustration episode was held constant, in accord with the labelling rules described above. Consequently, the number of frustration episodes and the time spent in each could vary across subjects. This phenomenon perhaps suggests designing alternative ground truth labelling for future re-modelling work in this area by taking into account the length of time that each subject invested in the experiment, and adapting the length of the frustration episodes accordingly. It might be interesting to compare the system's performance on the subset of subjects who claimed to have suspected something about the procedure with the performance on those who were properly deceived. We have separately plotted histograms showing the system's performance in both cases in Fig. 8. Since of the 24 subjects whose data we analysed, only five claimed to have suspected the deception, it is difficult to draw any solid conclusions as to what extent a priori knowledge of the intention of the experiment affects the performance of the system. However, an interesting pattern emerges when we compare the system's performance for suspecting and deceived subjects separately. The average performance of the system on the training set was lower (79.22%) for suspecting subjects than it was for deceived subjects (82.57%). This pattern is also consistent on the testing set (58.61% on suspecting subjects versus 69.71% on deceived subjects). This observation may lend support to the validity of the deception scheme used in this experiment as well as to the ground truth used to label the data: If the system were to consistently perform better on those subjects who were deceived, then it may be argued that the deception procedure successfully elicited different responses on them and that we are accurately targeting those segments of the data where this response is reflected through our labelling rules. Fig. 8 Open in new tabDownload slide Histogram of overall recognition rates contrasting deceived subjects with undeceived subjects (training and testing sets). Fig. 8 Open in new tabDownload slide Histogram of overall recognition rates contrasting deceived subjects with undeceived subjects (training and testing sets). The analysis presented here suggests that a valid induction procedure has been used to elicit responses during episodes of frustration and non-frustration that are categorically different. On 21 of 24 subjects studied, these categories have been successfully modelled with computer learning algorithms that are able to learn the difference between the patterns of each category from a set of training data and to apply this knowledge towards classifying the categories in a testing set of data previously unseen by the system. The algorithm's performance significantly exceeds the performance of a system that predicts the categories at random. Improvements to the performance figures presented here can be imagined if we refine the ground-truth labelling rules to overcome the inherent uncertainty in the knowledge of what states the users were in during the experimental procedure. 3.4 Characterising mouse-clicking behaviour The methodology used in this experiment's design also allowed us to look at a behaviour variable. We examined the mouse-clicking behaviour of the user during each episode where the mouse appeared to fail. Specifically, we computed the number of mouse-clicks following each such stimulus, and fit distributions to these data. We expected that some subjects would be very ‘passive’ showing few or no extra clicks, whereas some subjects would show a large number of clicks in response to the delay stimuli. We clustered the data sets of click behaviour obtained from the 24 subjects to examine whether similar patterns of behaviour could be found among the users. Assuming an underlying Poisson counting process governing each cluster, we constructed clusterings of K=3 to 5, using an iterative K-means algorithm. Using this approach, we obtained 4 distinct clusters for the entire data set. The Poisson distributions, their mean value, and the number of subjects who fit each cluster are shown in Fig. 9. The horizontal axis represents the number of clicks, and the vertical axis represents the probability of that number of clicks being made by a user in that cluster. Fig. 9 Open in new tabDownload slide Poisson distributions for each cluster, illustrating four distinct patterns of mouse-clicking behaviour when the system appeared to be stuck. Fig. 9 Open in new tabDownload slide Poisson distributions for each cluster, illustrating four distinct patterns of mouse-clicking behaviour when the system appeared to be stuck. 3.5 Discussion The Poisson distributions reveal four different types of behavioural responses to the stimuli. The upper-left panel, for instance, indicates a type of person who usually just waited without clicking, occasionally clicked one extra time, and rarely clicked more than that. As we move to the upper-right and lower-left panel, we see this behaviour shifting to a higher number of clicks. Finally, the lower-right panel represents a cluster of users who always made superfluous clicks; usually many of them. The results obtained from the behavioural measure suggest that for 19 of 24 subjects (i.e. all but the subjects in the first cluster), superfluous clicking was a natural response. An automated affect-recognition system that uses such behaviour in trying to recognise episodes of frustration would require, however, that the system not only be able to link such behaviour to frustration for that user, but also that the system have precisely-timed awareness if its own behaviour, a kind of rudimentary ‘self-awareness’, so that it can detect events such as delays followed by “catapulting forward”. Indeed, a variety of typed patterns, e.g. repeating erroneous commands, can provide clues to a user's affective state. The current system could be augmented to measure other forms of physical interaction, including the pressure and direction of pressure exerted by the user on the mouse. The mouse-clicking patterns discovered here are just one of many possible characteristic behaviours for further exploration. 3.6 Conclusions This section on pattern recognition has dealt with particular methods of statistical modelling which we can apply to the problem of recognizing frustration episodes in human–machine interaction. We have presented the analysis of two kinds of user data that can be relevant to the analysis and diagnosis of the quality of the interaction. In Sections 3.1–3.3, we discussed the analysis of physiological data gathered during qualitatively different interactions with a computer interface. We have discussed the issues relevant to obtaining a reliable annotation of the data in terms of these qualitative labels (establishing a ground truth), used these labels to build models which are able to learn the physiological patterns associated with each and to generalise these results to further data. In order to cope with the uncertainty in the mappings between internal states and patterns in the physiological data, we have adopted a statistical modelling framework and implemented one statistical technique for modelling time series (a hidden Markov model). We have shown that this approach achieves our goal of learning and predicting the frustration categories from the data with an accuracy exceeding 50% (random guessing) for 21 of 24 subjects considered. We have also analysed a methodology-specific behavioural measure from the users in Sections 3.4 and 3.5, namely the number of mouse-clicks a user entered when faced with a non-responsive interface. We have shown how different users follow behavioural patterns conforming to statistically different distributions, and have suggested that such measures may be incorporated into the sensing system to augment the number of input channels from which a user can convey a state of frustration. 4 Methodological recommendations This section details the methodological issues we encountered in the process of creating this experiment to elicit frustration. We describe experiment-specific solutions as well as a recommended general principle for each design point. One might think that it is easy to build a system that frustrates users. However, we found that it was quite difficult to build a system that frustrates users in a way that is reliable, repeatable, controllable, and characteristic over a series of individuals. In order to create stimuli that effectively elicited an emotional response of likely frustration in the user, we looked at a number of possible scenarios, but quickly settled on flouting several established user-interface design guidelines described by Mayhew (1992). Specifically, we built a system that impeded the user's goal to score well in a time-limited visual perception “game”, by causing unprovoked delays of seemingly random duration at seemingly random points during play. 4.1 Supporting the deception We encountered several instances during the building of the interface for this experiment where it was necessary to alter standard methods for user feedback, often in counterintuitive ways. For example, Macromedia Director 5.0 features a GUI builder that offers easy-to-create widgets with built-in visual feedback. In particular, a button-builder yields a button that, when clicked, provides immediate reverse-flashing of the button. However, this experiment required this immediate feedback feature to be disabled. If the buttons continued to provide reverse-flashing upon release of the mouse button, users might not believe the deception that the mouse/system were malfunctioning. Since we wished to otherwise support direct manipulation in the interface, we chose to change the immediate feedback on button clicks from this standard flashing to simply showing the next puzzle. Recommendation: Eliciting emotional responses in the laboratory often involves deception. Interface design for experiments should support this goal, although it may include the reversal of established HCI guidelines, such as removing standard feedback mechanisms (in this case, reverse-flashing buttons). 4.2 Adding delays to manage delays When we first built the frustration-eliciting system, we began with a simple puzzle game, in which a user could click on one of four ‘solution’ buttons at the bottom of the screen to indicate the user's solution to the puzzle, and the next puzzle to solve would instantly appear (Fig. 3). Since tallies were added up at the end of the game, users did not know if their answers were correct. We then installed into the game what, to users, seemed to be random system delays lasting from 2 to 4 s, such that clicking one of the four solution buttons would not advance the screen, when we beta-tested the system on representative users (college and graduate students who did not know the system was rigged to pause), we found that testers were at a loss to account for the system failure, and often responded by repeated, rapid-fire clicking of the mouse on the same solution button, and sometimes on the other buttons. Since the normal operation of the interface involved the immediate advancement to the next puzzle upon clicking one of the four solution buttons, rapid-fire clicking proved disastrous. Once the pre-programmed delay ended, this rapid-fire clicking would catapult users unintentionally (but irrevocably) past several subsequent puzzles, until the user realised s/he had regained control of the system. We did not want users to skip puzzles inadvertently, since it would skew many critical aspects of the experiment. We found that all beta-testers recovered from their rapid-fire clicking within 600 ms of regaining apparent control of the interface (i.e. the next puzzle advanced as usual). We therefore implemented a one-second ‘fail-safe’ delay on the puzzle that immediately followed each freeze-delaying puzzle. Since users invariably took over a second to complete each puzzle and move on to the next, this ‘echo’ delay had the effect of mitigating the rapid-fire catapult behaviour, while remaining invisible to the user. Subsequent user testing revealed that this fix was completely effective. Recommendation. Observe natural user interaction in iterative testing of the system before using it in the study. Be sure to note user behaviours such as body language and facial expression. Note that some user behaviours, including those that may have emotionally-charged components (such as repeated clicking of the mouse in apparent frustration), may require complex and counterintuitive redesigning of the system in order to compensate for them, or to elicit desired emotional reactions. 4.3 Randomising delays We needed to support the deception that the mouse/system was malfunctioning as a matter of random chance. We therefore dealt with the issue of occurrence of the delays by varying delay times, randomising the occurrence of delays within games, and varying the amount of delays over the three possible games a subject would play (Fig. 10). Fig. 10 Open in new tabDownload slide Delay schedule for the three game sessions. Fig. 10 Open in new tabDownload slide Delay schedule for the three game sessions. Recommendation. Simulate randomness and other elements of the variety in real life scenarios as much as possible. 4.4 Synchronisation and context An important aspect of this study was the realisation that sufficiently sensitive instruments can be used in tandem with sophisticated computational media to create the foundation for systems that may be able to sense affect in the user. A critical requirement for such a system is timing and contextual knowledge. The system needs to be furnished with much more than physiological signals: it needs detailed, highly accurate information on when those signals were created, and under what circumstances. To support integrated millisecond-accuracy synchronization, a digital clock was hand-built into the game software using Director 5.0, and used both to display the time elapsed for the current game to the user, and as a gross index for synchronisation with the sensing system. The time was displayed in small type (24-point; Fig. 2) on the main monitor, and in large (124-points) type on the smaller monitor, which faced the video camera. Both displays showed minutes and seconds since the start of the current game. This layout served to reinforce the time pressure on the subject, as well as to capture the exact time on video for synchronisation of key events in the study. Director 5.0 enables one to write messages to the Message Window in a logfile, a window that may be shown or hidden at runtime. In this experiment, concurrently with each mouse-click, messages were written to the Message Window, which was hidden from the subject's view. Once a game was completed, the administrator would debrief and excuse the subject, reveal the Message Window, and paste the contents of the window to a text file for permanent recording of this data. In the logfile, the same timing scheme shown to the user was recorded, in minutes and seconds, as was the computer's own clock time at the start of the experiment. To further refine this measurement scheme, the logfile also recorded the current number of the system's ‘ticks’ at each mouse-click and at other strategic points in the experiment. Ticks are Director's fine-grained time-measurement scheme, occurring every 8 ms, and counted from the moment that Director was most recently started. Together, these measures provided the high degree of timing accuracy that is needed to synchronise time-sensitive physiological data with real-world stimuli. The sole input device with which subjects interacted was a standard Macintosh mouse that had been modified so that it included a second cable that plugged into the physiological sensing system (described earlier), and yielded a pulse on each mouse-click. Every time that the modified mouse was clicked, it was recorded both as a timed event in the logfile and as a pulse in the sensing system. By modifying the mouse hardware to ‘talk directly to’ the physiological sensing system, behavioural mouse-clicks and physiological responses were accurately synchronized. Since the logfile generated by the game application also recorded contextual information about mouse-clicks—correct/incorrect game answer, puzzle number, and occurrence and status of the system delays—this altered mouse yielded a mouse-click record that served as critical, high-precision synchronisation data between stimulus and user response (Fig. 2). Recommendation. Multiple data inputs must be very precisely synchronized, which may require creating overlapping events recorded on multiple systems to facilitate their alignment. This may require customised means such as novel hardware modifications. 5 Conclusions and future directions This paper has described an experimental methodology for eliciting events likely to lead to user frustration, and for successfully gathering and synchronising precise physiological, behavioural, visual and operational data in pursuit of automated recognition of user affect. Four general methodological principles were proposed and illustrated with a specific experimental design. This design was successfully used to gather accurately synchronized data from 24 subjects. We analysed the physiological and behavioural data gathered, proposing new features for extraction from the physiological portion of this data, and developing an automatic technique for classifying the features using hidden Markov models. The resulting classification was significantly better than random for 21 out of 24 subjects, suggesting that there is some important discriminating information in the two physiological signals of GSR and BVP, although this discrimination is far from perfect. We also found four classes of mouse-clicking patterns exhibited by users when the system did not advance to the next screen on the first click. Both the physiological and mouse-clicking patterns point to user-dependent responses, but patterns that a machine could nonetheless begin to model, and potentially learn to recognise. In future experiments and applications, the specific signals collected and features analysed can be expected to vary according to different goals. Here, we used five features taken from GSR and BVP, while another implementation might use heart-rate variability taken from an electrocardiogram and muscle tension taken from an electromyogram. We may in fact discover at some point that other sensors are more ideal for the current experiment than the ones we actually used. The key guiding principles presented in this work, however, are invariant to the specific physiological signals measured. Means of precise synchronization and linking to external and behavioural context are the key contributions of the methodology presented here for gathering and making use of physiological information. Even in an ideal affective computing system, we envision that user responses will not always be unambiguous, and that in some cases, the recognition system may need to prompt the user for subjective input. This prompting will also need to be sensitively conducted in a way that does not increase the user's frustration. Over time, we expect that a system could ‘get to know’ an individual's patterns of frustration (and other emotion-related responses), and correlate these with system behaviours that might be responsible for the frustration. Although the system would not necessarily be able to deduce causation, with a little more input on the part of the user such deductions might be possible. A proactive system might occasionally ask the user something like: “Would you prefer that this system's behaviour X go away?” Information regarding which system functions are most correlated with episodes of user frustration could be extremely valuable for human–computer interaction designers, providing them with an ‘ongoing human factors’ analysis, not just before a product is released, but continuously, while interaction proceeds in situ. Until the correct combination of physiological and behavioural signals becomes apparent for recognising a state such as user frustration, there should be more focus on specific pattern recognition techniques. A logical next step would be to repeat this experiment, using the same methodology, but varying the situations to induce a broader range of emotional responses. For example, we could run the same game, but instead of injecting likely frustration-eliciting stimuli, we could inject likely pleasure-eliciting stimuli, such as the computer game adding extra points to the user's score or the computer presenting the user with sincere-sounding praise for something the user did (Morkes et al., 1998; Reeves and Nass, 1996). The system described in this paper is designed to be extensible to such future inquiries. Although we collected up to three different data sets from each subject, a second goal is to take a more detailed look at individual response in a longitudinal design; gathering a larger amount of signals from single subjects over a series of repeated observations, especially over many days. In related work on recognising emotional expression in physiology, it has been observed that there can be more difference in how the same emotional response is expressed on different days, than there is in how different emotional responses are expressed on the same day (Vyzas and Picard, 1998). Ideally, an affect-intelligent computer should be able to use the information it gains from the user to enhance the computer–human interaction. If a system recognises that the user is experiencing distress, it might act to ameliorate that stress, or simply monitor it and make an internal note associating one of the system behaviours with a probability of frustration. In a companion paper in this special issue, Klein et al. describe alternate responses that a computer agent might use to try to help a user reduce frustration that arises in a human–computer interaction (Klein et al., 2001). Ethical, philosophical and other considerations of such responses are discussed at length in Picard and Klein, also in this issue (Picard and Klein, 2002). Whatever the strategy, the system will probably work best once it learns the individual preferences of its user, possibly including characteristics of the user's personality. Eventually, we hope to address complex affective data sets collected from natural situations occurring outside the laboratory. This may be done by porting the sensing and recognition systems presented here to wearable computers, equipped not just with sensors to detect the user's emotional expression, but also with means to discern information about the user's situation. In sum, we suggest that the methodology presented here has many applications outside the specific experiment described in this paper. This methodology addresses key design issues involved in the simultaneous monitoring of several input devices, while also providing data for subsequent pattern analysis, all within the context of trying to learn more about characterizing a user's affective response. Our broader goal echoes Winograd et al.'s (1996) view that we must perform experiments that pay close attention to the entire ‘user experience’. We have emphasised that a critical part of this experience involves emotion, and that an affective computer could address this important component of the user experience by trying to recognize and respond appropriately to the user's emotion. Although there is till much to be investigated, including real-time accurate recognition of user signals, improvement of sensor selection, exploratory analyses of more behavioural variables, and improvement of machine awareness of situations, we believe that the approach presented here offers a significant first step toward the development of computers that not only pay close attention to user experience, but begin to recognise and respond to the affective qualities that people naturally bring to a human–computer interaction. References Amsel, 1992 Amsel A , Frustration Theory 1992 Cambridge University Press , Cambridge Ark et al., 1999 Ark W Dryer C Lu D.J , The emotion mouse Bullinger H Ziegler J Human–Computer Interaction: Ergonomics and User Interfaces 1999 Lawrence Erlbaum , New Jersey OpenURL Placeholder Text WorldCat Cacioppo and Tassinary, 1990 Cacioppo J Tassinary L , Inferring psychological significance from physiological signals , American Psychologist 45 ( 1 ) 1990 ) 16 – 28 Google Scholar Crossref Search ADS PubMed WorldCat Dawson et al., 1990 Dawson M Schell A Filion D , The Electrodermal System Principles of Psychophysiology: Physical, Social and Inferential Elements 1990 Cambridge University Press , Cambridge OpenURL Placeholder Text WorldCat Ekman et al., 1983 Ekman P Levenson R.W Friesen W.V , Autonomic nervous system activity distinguishes among emotions , Science 221 ( 1983 ) 1208 – 1209 Google Scholar Crossref Search ADS PubMed WorldCat Fernandez, 1997 Fernandez, R., 1997. Stochastic modeling of physiological signals with hidden Markov models: a step toward frustration detection in human–computer interfaces. MS Thesis at the MIT Media Laboratory. Helander, 1978 Helander M , Applicability of driver's electrodermal response to the design of the traffic environment , Journal of Applied Psychology 63 ( 4 ) 1978 ) 481 – 488 Google Scholar Crossref Search ADS PubMed WorldCat Henning et al., 1995 Henning, R.A., Callaghan, E.A., Guttman, J.I., Braun, H.A., 1995. Evaluation of two self-managed rest break systems for VDT users. Symposium Proceedings of the Human Factors and Ergonomics Society 39th Annual Meeting, 2, pp. 780–784. Kiesler et al., 1985 Kiesler S Zubrow D Moses A.M Geller V , Affect in computer-mediated communication: an experiment in synchronous terminal-to-terminal discussion , Human–Computer Interaction 1 ( 1 ) 1985 ) 77 – 104 Google Scholar Crossref Search ADS WorldCat Klein et al., 2002 Klein J Moon Y Picard R.W , This computer responds to user frustration: theory, design, and results , Interacting with Computers 14 ( 5 ) 2002 OpenURL Placeholder Text WorldCat Kramer, 1991 Kramer A.F , Physiological metrics of mental workload: a review of recent progress Damos D.L Muliple-Task-Performance 1991 Taylor & Francis , London 329 – 360 OpenURL Placeholder Text WorldCat Lang et al., 1993 Lang P.J Greenwalk M.K Bradley M.M Hamm A.O , Looking at pictures: affective, facial, visceral, and behavioral reactions , Psychophysiology 30 ( 1993 ) 261 – 273 Google Scholar Crossref Search ADS PubMed WorldCat Lawson, 1965 Lawson R , Frustration: The Development of a Scientific Concept 1965 MacMillan , New York Marrin and Picard, 1998 Marrin, T., Picard, R.W., 1998. The conductor's jacket: a device for recording expressive musical gestures. Appeared in Proceedings of the International Computer Music Conference, December 1998. Mayhew, 1992 Mayhew D.J , Priniciples and Guidelines in Software User Interface Design 1992 Prentice Hall , Englewood Cliffs, NJ Morkes et al., 1998 Morkes, J., Kernal, H., Nass, C., 1998. Effects of humor in computer-mediated communication and human–computer interaction. Proceedings of the Conference of Human Factors in Computer Systems (CH198 Summary), CHI 98, Los Angeles. Papillo and Shapiro, 1990 Papillo J Shapiro D , The cardiovascular system Cacioppo J Tassinary L Principles of Psychophysiology: Physical, Social and Inferential Elements 1990 Cambridge University Press , Cambridge OpenURL Placeholder Text WorldCat Picard, 1997 Picard R.W , Affective Computing 1997 MIT Press , Cambridge, MA Picard and Healey, 1997 Picard R.W Healey J , Affective wearables , Personal Technologies 1 ( 4 ) 1997 ) 231 – 240 Google Scholar Crossref Search ADS WorldCat Picard and Klein, 2002 Picard R.W Klein J , Computers that recognise and respond to user emotion: theoretical and practical implications , Interactring with Computers 14 ( 5 ) 2002 OpenURL Placeholder Text WorldCat Rabiner and Juang, 1986 Rabiner L.R Juang B.H , An introduction to hidden Markov models , IEEE ASSP Magazine January ( 1986 ) 4 – 16 Google Scholar Crossref Search ADS WorldCat Reeves and Nass, 1996 Reeves B Nass C , The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places 1996 Cambridge University Press , New York Rowe et al., 1998 Rowe, D.W., Sibert, J., Irwin, D., 1998. Heart rate variability: indicator of user state as an aid to human–computer interaction monitoring the complexity of real users. Proceedings of ACM CHI 98 Conference on Human Factors in Computing Systems, 1, pp. 480–487. Scheirer et al., 1999 Scheirer, J., Fernandez, R., Picard, R.W., 1999. Expression glasses: a wearable device for facial expression recognition. Proceedings of CHI'99, Pittsburgh, PA. Schneiderman, 1986 Schneiderman B , Designing the User Interface: Strategies for Effective Human Computer Interaction 1986 Addison-Wesley , Reading, MA Schlosberg, 1954 Schlosberg H , Three dimensions of emotion , Psychological Review 61 ( 1954 ) 81 – 88 Google Scholar Crossref Search ADS PubMed WorldCat Vyzas and Picard, 1998 Vyzas, E., Picard, R.W., 1998. Affective pattern classification. AAAI Fall Symposium Series: Emotional and Intelligent: The Tangled Knot of Cognition, Orlando, FL, 23–25 October. Wastell, 1990 Wastell, D., 1990. Mental effort and task performance: towards a psychophysiology of human–computer interaction foundations: cognitive ergonomics. Proceedings of IFIP INTERACT'90: Human–Computer Interaction, pp. 107–112. Wiethoff et al., 1991 Wiethoff, M., Arnold, A.G., Houwing, E.M., 1991. The value of psychophysiological measures in human–computer interaction. Proceedings of the Fourth International Conference on Computer Interaction, 1, pp. 661–665. Wilson and Eggemeier, 1991 Wilson G.F Eggemeier F.T , Psychophysiological assessment of workload in multitask environments Damos D.L Multiple-Task-Performance 1991 Taylor & Francis , London 329 – 360 OpenURL Placeholder Text WorldCat Winograd et al., 1996 Winograd T Bennet J De Young L Hartfield B Bringing Design To Software 1996 Addison-Wesley , Reading, MA Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC © 2002 Elsevier Science B.V. All rights reserved. TI - Frustrating the user on purpose: a step toward building an affective computer JF - Interacting with Computers DO - 10.1016/S0953-5438(01)00059-5 DA - 2002-02-01 UR - https://www.deepdyve.com/lp/oxford-university-press/frustrating-the-user-on-purpose-a-step-toward-building-an-affective-6MDVFIRUZ6 SP - 93 EP - 118 VL - 14 IS - 2 DP - DeepDyve ER -