TY - JOUR AU1 - Yuhui,, Wang AU2 - Chunfu,, Li AU3 - Tian,, Lei AU4 - Qin,, Huang AB - Abstract This paper reports a longitudinal study on the retrospective evaluation of perceived usability. We propose a retest method based on SUS questionnaires to reduce the multiple-observation effects typically present in longitudinal research. We designed different forms of the SUS questionnaire to eliminate memory effects and distributed duplicate SUS copies to subjects. We first measured the alternative-form reliability of duplicate in Experiment I (⁠|$\textbf{r =0.814}$|⁠). We then found that the duplicates we designed significantly reduce memory effects compared to the standard SUS questionnaire. Experiment II involved tests on the retrospective evaluation of perceived usability at five measurement points after initial use. The retrospective evaluation of perceived usability changes over time. After 5 min (at point |${\mathrm{T}}_2$|⁠), we found that each subject’s retrospective evaluation changed significantly regardless of whether the task had higher or lower usability levels. Moreover, the `directions’of this change tended to differ. This indicates that time is an important factor in perceived usability evaluation. Our results altogether indicate that time affects the retrospective evaluation of perceived usability as well as the accuracy of perceived usability evaluation, which suggests that the process of perceived usability assessment requires strict time constraints. RESEARCH HIGHLIGHTS 1) Based on the SUS standardized questionnaire, this paper presents a retest method for the longitudinal assessment of perceived usability. 2) We prove that this method has a high alternate from reliability and eliminates most of the memory effects of retests administered within a short time period. 3) Experiment II proves that the user’s retrospective assessment of perceived usability changes over time, especially in a fairly short time granularity (5 min). 4) The change direction of retrospective evaluation differs at different levels of usability. 5) The test process requires strict time control conditions to ensure accurate perceived usability information. 1. Introduction Traditional usability is defined as `the effectiveness, efficiency and satisfaction with which specified users can achieve specified goals in particular environments’ (ISO 9241-11-1998). The usability of a product from the user’s perspective is its `perceived usability’. As opposed to `inherent usability’ or `performance-based usability’, the perceived usability represents the user’s subjective feelings (Yang et al., 2012). Among the existing methods for perceived usability testing, the most commonly used is the standardized post-study questionnaire (Sauro and Lewis, 2012). Many studies on human computer interaction and usability engineering have centered on the measurement of perceived usability via post-study standardized questionnaires such as SUS (Brooke, 1996), UMUX (Finstad, 2010b), UMUX-LITE (Lewis et al., 2013), SUMI (Kirakowski and Corbett, 1993), USE (Lund, 2001) and QUIS (Chin et al., 1988). Post-study questionnaires for usability testing are distributed to subjects to determine their responses to using the system. The experimenter provides the test system to the subjects, describes all relevant settings and then distributes a standardized questionnaire, which the subjects fill out and return to the experimenter after the task is completed. The experimenter calculates the perceived usability scores and determines whether the subjects will recommend the system to others according to their evaluations. The subjects are usually not familiar with the product prior to such an experiment—that is, the results of usability tests provide information about the first stage of user–product interaction, but not about user–product interaction changes over time (Sonderegger et al., 2012). This information is important, however, as individuals’ recollections of past experiences—and their reports of said experiences to others—guide their future behavior (Norman, 2009). There are two main approaches through which users recommend products to others. The first involves user experience (UX) based on long-term use. Long-term positive experiences boost user loyalty and have a positive impact on each stage of product use (Fenko et al., 2009). The second is one-off tests, or `snapshots’ of usability (Mendoza and Novick, 2005). Snapshots are formed after using a product only once. They can also be defined as the `orientation phase’ of product use (Karapanos et al., 2009). The user may describe this experience to others (Kujala et al., 2011) and make recommendations by recalling the usability of the product (Kujala et al., 2013). Among the two methods, time plays an important role. The first is a longitudinal experience, which takes place as time goes on, and the second is a retrospective evaluation made after a certain period of time. Many previous researchers have explored, from a longitudinal perspective, how the perception, evaluation and use of technology develop over time (Karapanos et al., 2012). Perceived usability is considered to be a relatively high-level usability structure (Lewis et al., 2015; Lewis, 2018; Hornbeak, 2006) characterized by user-oriented evaluation. Many usability evaluations center on perceived usability and the user’s evaluation of usability changes over time. For example, a longitudinal study conducted by Sonderegger et al. (2012) revealed that the influence of esthetics on perceived usability change with time. Kjeldskov et al. (2010) evaluated the usability of a health care information system in a longitudinal study. Most previous studies have shown that the user’s evaluation changes over long periods of time (Kjeldskov et al., 2005; Karapanos, 2008; Kujala et al., 2013; Harbich and Hassenzahl, 2017). These changes are typically identified on evaluation indexes (e.g. esthetics, experience, usability) with time according to the relationships among them. There is a long history of research on the gap between emotional recall and experience in the field of psychology (McFarland and Ross, 1987; Thomas and Diener, 1990; Robinson and Clore, 2002). The research on this topic is focused on emotional memories. The retrospective evaluation of emotion is believed to markedly differ from an evaluation given at the present moment. The evaluation may grow more favorable or less favorable with time, and the intensity of such change may also differ over time (Walker et al., 1997; Talarico et al., 2004; Schäfer et al., 2014). Retrospective studies on HCI have been conducted by many researchers. For example, Kujala and Miron-Shatz (2013) proposed that both emotions and individuals’ distinct recall of emotions play strong, unique roles in the overall evaluation of a product. Bruun and Ahm (2015) also found that there are significant differences between retrospective and concurrent experiences. Psychological studies have shown that compared to a concurrent evaluation, the user’s retrospective evaluation may markedly change (Mironshatz et al., 2009). However, in the longitudinal and retrospective research on HCI, researchers rarely explore such changes in perceived usability and use relatively large measurement time granularities. Further, some studies have shown that the speed of emotional change is very fast—down to even a few minutes or seconds (Schäfer et al., 2014). In HCI, there has been little research on such short time granularities whether in longitudinal or retrospective evaluations. Moreover, multiple-observation effects (Bordens and Abbott, 2007) may render experimental methods of longitudinal and retrospective studies altogether unsuitable for short time granularities. The purpose of the present study was twofold. We report our observations on whether retrospective assessment shows longitudinal changes at different usability levels, especially within a short time granularity. We also report our attempts to reduce multiple-observation effects in a short period of time, which mostly involved redesigning perceived usability retests. We reviewed the extant literature on longitudinal research and retrospective evaluation in HCI, as well as the retest methods used by previous researchers. We designed a new alternative-form SUS questionnaire, then designed tests to assess its reliability and to verify whether it eliminates memory effects. We ran a longitudinal experiment using our method to evaluate the retrospective perceived usability of two tasks over several repetitions, including on a short-time-granularity retest (5 min). We made several observations and defined several future research directions accordingly. 2. Related work 2.1. Developmental study of perceived usability Usability testing is one of the most important and most widely used methods for evaluating the quality of a product’s design (Lewis, 2006). Such tests may be targeted toward perceived usability, performance-based usability and inherited usability. Perceived usability intuitively reflects user attitudes and judgments, which researchers and developers consider to be very valuable (Sauro and Lewis, 2012). Perceived usability is affected by a variety of factors over the evaluation process. Design esthetics (Sonderegger and Sauer, 2010; Hassenzahl and Monk, 2010), state of the user (Sauer and Sonderegger, 2011), user expertise and prototype fidelity (Sauer et al., 2010), age (Sonderegger et al., 2016) and other factors are common culprits. These studies provide a workable reference for improving evaluation accuracy as well as reveal several factors that limit the perceived usability evaluation environment. Additionally, temporal factors are also an important factor in perceived usability. The impact of time on perceived usability can be broadly classified into two groups: evaluation of the accompanying use process (longitudinal studies) and retrospective assessment. Longitudinal studies are necessary for future usability testing (Sonderegger et al., 2012). In these studies, the amount of time the system is used affects the evaluation of perceived usability (Sonderegger et al., 2012) and also affects the user’s experience. Usability is considered an essential element of the UX (Albert and Tullis, 2016), but the definition of UX continues to expand (Hassenzahl, 2006; ISO9241-210), and there is a significant relationship between usability and UX (Vermeeren et al., 2010; Mahlke and Thüring, 2007; Park et al., 2013; Lallemand et al., 2015). Users may temporarily tolerate lower usability in a product due to esthetic or other characteristics, but usability becomes more important with increased use time, especially for targeted or long-term tasks (Kujala and Miron-Shatz, 2013). Due to the overlap or usability and UX, many UX longitudinal studies have measured changes in perceived usability over time. Sonderegger et al. (2012) found that the positive effects of esthetics on perceived usability and emotions wane over time (2 weeks). Kujala et al. (2011) evaluated the perceived attractiveness, ease of use, utility and degree of usage of a product to find that attraction indeed decreases while other factors improve over time. Karapanos et al. (2009) also found that perceived value and relative dominance change over time—the early usage of products tends to reflect hedonic desires, but the standards for evaluation change as time passes (4 weeks). In a longitudinal study on social network services, Kim et al. (2015) found that usability, affect and user value change over time (1 week). Sonderegger et al. (2012) found that the positive effects of products in perceived usability assessments decrease as exposure time increases (2 weeks). All of these studies observed changes in perceived usability over time, which is reasonable given that UX is a long-term process. Time can also alter the study results because researchers have observed significant changes in retrospective assessment of perceived usability (Kujala and Miron-Shatz, 2013; Mcdonald et al., 2013; Haak et al., 2003). Continuous measurements may differ between retrospective studies and longitudinal studies. Many early investigators confirmed that memory lacks reliability (MironShatz et al., 2009), which is what separates longitudinal studies from retrospective studies. Researchers and practitioners in the HCI field use retrospective evaluation as a direct indicator of product quality. The implicit assumption is that retrospective evaluation is the mean or sum value of all the moments encountered over the course of experience (Hassenzahl and Ullrich, 2007). However, there are memory experience gaps in retrospective studies (Bruun and Ahm, 2015). Deviations in emotional attenuation and recollection can cause changes in retrospective assessments as time goes on. Memories of a certain feeling fade as time goes by (Kaplan et al., 2016). The gap between retrospective evaluation and concurrent experience exists in a variety of cases, such as the rank of pain (Redelmeier and Kahneman, 1996), memories of pleasant and unpleasant feelings (Mironshatz et al., 2009) and vacation experiences (Mitchell et al., 1997). Perceived evaluations are formed in the same way as emotions are—both are geared toward a particular event or object (Kaplan et al., 2016). Therefore, the user’s retrospective evaluation of perceived usability is different from the initial, concurrent evaluation. There is a `gap’ between the two evaluations. In HCI, this gap has been thoroughly validated. Individuals tend to overestimate their own evaluations (Mironshatz et al., 2009). Seo et al. (2015) investigated the relationship between perceived usability and perceptual esthetics as well as the relationship between emotions and reactions and identified a positive correlation between perceived usability and emotion. Hassenzahl and Sandweg (2004) used a mental effort evaluation experiment to prove that the retrospective evaluation of perceived usability is not a reflection of the whole stage of usage. It instead reflects the most recent usage events—evaluations can change within 2 h. Kujala and Miron-Shatz (2013) studied the changes in emotion and experience during the daily use of mobile phones over a 5-month period; they found that users overestimated their positive emotions and the importance of usability over time in the early stages of use. Bruun and Ahm (2015) also found a significant discrepancy between retrospective and concurrent ratings of emotion. The retrospective evaluation did not accurately reflect the emotional state of the whole observation period, and negative-experience retrospective ratings were overestimated. The gap can be explained by the peak-end rule (Bruun and Ahm, 2015), as well as by recency effects (Hassenzahl and Sandweg, 2004). Other researchers hold opposing views, suggesting that retrospective evaluation may also have advantages. Karapanos et al. (2012), for example, argue that retrospective judgments are more realistic for users than judgments given `in the moment’. Studies by Kortum and Bangor (2013) and Kortum and Sorber (2015) have shown that retrospective usability measures are a more effective approach because the `face validity of the assessment is improved’, and the tests are lower in cost. Retrospective reviews are important because people recommend products to others based on retrospective assessment. However, retrospective evaluation may change with time, and this instability is related to the time of evaluation. According to the interval between the measured time points, Moellendorff et al. (2006) divide the `longitudinal study’ concept into three granularities: micro-perspective (e.g. 1 h), meso-perspective (e.g. 5 weeks) and macro-perspective (whole product lifecycle). Changes in perceived usability over time have been observed in longitudinal studies from 15 min (Minge, 2008) to several months (Kujala and Miron-Shatz, 2013; Karapanos et al., 2010). Although the time granularity coverage tends to be large in such studies, it is not very comprehensive. It is particularly worth pointing out that the testing time granularity of retrospective studies is concentrated in the meso-perspective. However, research focused on the meso-perspective is obviously incomplete. Emotion plays an important role in HCI as it affects the actual interaction between the user and the product (Forlizzi and Battarbee, 2004). Further, the rate of attenuation may differ for different emotions (Berntsen, 1996; Walker et al., 1997; Talarico et al., 2004). Different types of emotions also exert different effects on usability (Kujala et al., 2011): positive emotions are usually associated with higher UX, while negative emotions always relate to low usability (Kujala and Miron-Shatz, 2013). More importantly, some studies (Nagel et al., 2007; Schäfer et al., 2014; Caetano et al., 2012) have unexpectedly revealed that the attenuation of emotion is not long-lasting, but instead that arousal of and changes in emotion occur relatively quickly—within a few minutes or even seconds. Emotional changes may happen very early in the course of product usage, and emotion does not decay over time in a linear fashion, but rather in a negative exponential curve (Qiu, 2001). Further, the attenuation speed and strength are different for positive and negative emotions. So does the retrospective usability assessment provided by a subject change in a very short time granularity (e.g. a few minutes)? Are the perceived usability changes consistent in direction and speed in the micro-perspective granularity? Is the perceived usability evaluation linear with time? Again, evaluations of product usage can change very quickly. Evaluation methods must be selected accordingly. The results of long-term evaluations may not accurately reflect the perceived usability over a shorter amount of time. Perceived usability tests may be much longer than the time it takes for an emotion to change. Long interviews or lengthy questionnaires may affect users’ opinions (Yang et al., 2012). Users may be biased due to the attenuation of their emotions, and lengthy questionnaires tend to produce fatigue effect (La Bruna and Rathod, 2005). The same question may thus be answered differently at the beginning versus the end of a given evaluation period. If the evaluation will change in a short time, this would suggest that a strict time limit may be required to ensure accurate perceived usability evaluations. Considering usability as a general concept, various types of questionnaires have been developed to evaluate all types of electronic products using discrete time limits (Yang et al., 2012). To do this, strict time limits must be set on the distribution time of the questionnaire, the preparation before the test and in assessing any extra variables related to the users’ familiarity with the test process—all factors that can cause deviations in test results. However, almost no researchers who have used post-study tests have extensively assessed these effects. Setting a strict beginning and end time for a task can improve the evaluation accuracy. This is particularly important for a large-sample test, because the time exerts a more significant impact on a greater number of subjects. Additionally, if a test result is based on a one-off post-study test, we can make a preliminary judgment of the direction and size of the error according to the usability scores of the task and the test time. Beyond that, one-off testing is a common practice but one generally considered ineffective (Sonderegger et al., 2012). When the user recalls the software being tested, or uses the product only a very small number of times, he or she does not gain meaningful experience. Early-stage uses may inform a decision to continue using the software, but not a true assessment of the software’s usability. For example, a longitudinal study may provide measurement of the actual usage of a piece of software (Chittaro and Vianello, 2016) but produce low overall usage reports without accounting for the precise reason that many users have abandoned the study. This suggests that longitudinal studies of perceived usability are essential to detect changes in user evaluations during long-term use. It is precisely because of such an evaluation that a user abandons unfamiliar software. However, previous researchers have generally neglected retrospective evaluation after one-off testing, both in various measurements of granularity and in two types of measurements. 2.2. Extant problems in perceived usability retesting methods Generally speaking, a standard-perceived usability testing process can be considered as a single evaluation or one-off test. For any product, the first use is very important (Thielsch et al., 2018). The user typically has not used the system before the test, so the test reflects the first impression of the product. It is also very common for users to make a retrospective review of a product after a single trial. Hassenzahl and Ullrich (2007) showed that users evaluate products based on memory, which guides them toward certain decisions for products that they continue to use and recommend to others. Therefore, a retrospective evaluation of perceived usability is very important. The results at different times may be different, even within a very short time (micro-perspective granularity). Developmental studies are needed in certain cases, where at least two measurements are used to assess perceived usability. Developmental studies also encompass a large number of retest designs. There are roughly three categories: pretest–posttest design, cross-sectional design and longitudinal design. Pretest–posttest designs may be applied before and after processing. Comparing the results between them reveals the influence of time throughout the process. Kjeldskov et al. (2010), for example, studied the changes in the usability experience of a novice grown into an expert; Karapanos (2008) similarly observed changes in evaluative judgments after the use of a interactive TV set-top box via pretest–posttest design. Although this method can reveal change effects in some aspects, it is difficult to control the key assumptions. The cross-sectional design generally uses novice and expert groups to implement classification in HCI-related longitudinal studies. For example, Mclellan et al. (2012) used a cross-sectional design to assess the differences in perceived usability of novices and experts. Generation effects related to the assumed differences between novices and experts affect the validity of these studies, however (Bordens and Abbott, 2008). Although cross-sectional designs are quick and cost-effective (Karapanos et al., 2010), definitions of `novice’ and `expert’ are difficult to establish (Prümper et al., 1992). In the longitudinal design, conversely, a single group of subjects is tracked for a period of time and clearly reveals developmental changes in UX or perceived usability (Karapanos et al., 2009; Minge, 2008; Kim et al., 2015). The longitudinal design does not produce the generation effect found in cross-sectional studies, as changes before and after can be effectively measured. However, this design requires a lot of time and effort to collect data. Some data may be lost and multiple-observation effects are easily produced (Bordens and Abbott, 2008). In addition to cross-sectional design studies, other researchers have used within-subject designs leading to retest or `multiple-observation’ effects (Calamia et al., 2012; Bordens and Abbott, 2008), which seriously threaten internal validity. Retests lead to changes in measurement scores that may be more obvious in the same test form (Arendasy and Sommer, 2013), these effects are large and widespread, but are usually underestimated (Goldberg et al., 2015). Retest effect has a variety of sources and can be explained by problem-solving strategies, fatigue effects, memory, practice and familiarity, age, motivation or other phenomena (Bordens and Abbott, 2008; Jones, 2015; Registermihalik et al., 2012; Salthouse and Tuckerdrob, 2008; Randall and Villado, 2017). These factors can be roughly divided into three categories: solution strategy refinement, familiarity and memory of items. Problem-solving strategies involve the use of various solutions of varying effectiveness and efficiency to solve test problems. Under the familiarity hypothesis, subjects inherently differ in the extent to which they are familiar with standardized testing and the mechanics of test processes (Arendasy and Sommer, 2017). Memory of items is a particularly important cause of differences between subjects, as memory of the results of a first test item affect the subsequent measurements of the item (Salthouse and Tuckerdrob, 2008; Miller et al., 2009; Beglinger et al., 2005). Subjects using the same test form in the retest process typically present memory effects (Lievens et al., 2007). Retest effects are more obvious within shorter intervals. The use of alternate forms can effectively minimize such effects within even short intervals (Calamia et al., 2012; Pereira et al., 2015). Calamia et al. (2012) and Bordens and Abbott (2008) found that the use of alternate forms that eliminate item memory can minimize retest effect. The use of alternate forms can significantly increase the memory of items and reduce the application of decision-making strategies, which prevents subjects from storing the items in memory (Arendasy and Sommer, 2017). Although a question bank can also be used for computer adaptive testing, this method is not suitable for paper questionnaires. In actual measurements, relatively few researchers have successfully reduced retest effects (Moellendorff et al., 2006) through balance (problem randomicity or circulatory appearance), even in studies with micro-perspective granularity (Minge, 2008). There is a wide array of methods available for measuring perceived usability including questionnaires, interviews and `think-aloud’ exercises. The most common perceived usability measurement method is the questionnaire (Lewis et al., 2015). Over 20 standardized questionnaires have been published to date (Assila et al., 2016). Among all existing questionnaires, SUS has been used particularly extensively (Lewis, 2018; Sauro and Lewis, 2009; Brooke, 2013). SUS includes only 10 opposing-tone items. The answer is designed to utilize 5 point Likert scale, therefore, SUS can be used to quickly measure user perceived usability. The use of the SUS questionnaire in a short-interval retest is susceptible to retest effects from both item memories (as there are only 10 items) and time interval increases. We analyze the source of SUS’s retest effects. The familiarity hypothesis is not suitable for SUS, because the Likert scale is very simple and easy to understand; subjects can fill it in quickly and clearly report their opinions. Retest familiarity is easily ruled out if subjects have filled out Likert scales prior to the study. SUS is fast, and there are only 10 options with relatively few dimensions to test (Lewis, 2018). There are not many efficiency strategies available to users in filling out a SUS as there are already relatively few questions (regarding items that are easily remembered) (Lievens et al., 2007). However, because of the simple structure of SUS, the retest effect comes from remembering the answers to the tests. In other words, the retest results do not fully reflect the changes in between perceived usability over time but instead reflect the memory of the most recent testing results. To reduce retest effects, alternate forms of SUS can be used interchangeably, but this means that the measurement invariance of SUS needs to be maintained, including invariance of structure, invariance of metric measurement (that is, unchanging discrimination parameters of items) and invariance of scalar measurement for checking difficulty (Widaman et al., 2010; Arendasy and Sommer, 2017). Unfortunately, although there is only a positive version of SUS available (Sauro and Lewis, 2011), half of the items in this version are the same as the original questionnaire. Alternative SUS versions need to be developed before measuring changes in perceived usability retest results and must comply with the measurement invariance. 2.3. Research methodology In addition to one-off testing, perceived usability evaluation can be conducted via longitudinal studies, retrospective studies, cross-sectional studies and other designs. Previous studies show that retrospective evaluations of usability inevitably change. However, the vast majority of previous longitudinal studies and retrospective studies focused on meso-perspective granularity (1 h to several weeks). Other studies have shown that emotions can change within a very short time frame, and that the process of generating emotions is similar to the development of perceived usability. Consequently, the evaluation scores of perceived usability may change within a very short time. There are a few reasons for the necessity of examining changes in perceived usability on retrospective assessments within a short period of time: (i) to improve the accuracy of perceived usability assessment; (ii) to improve the SUS questionnaire evaluation process; and (iii) to find the direction and rule of user evaluation changes as a basis for setting appropriate measurement times. The SUS has consistently shown high psychometric validity in perceived usability measurement, whether across various types of systems or in comparison against other questionnaires (Albert and Tullis, 2016; Lewis et al., 2015). It has also been significantly correlated with other questionnaires (Borsci et al., 2015; Berkman and Karahoca, 2016). Some researchers also believe that SUS can be used in retrospective studies (Gao and Kortum, 2017). In order to observe the change of perceived usability in a short time and to secure meaningful references, we also used the SUS questionnaire to complete the retest. We chose the SUS (modified from by Finstad, 2006) as the retest method for the present longitudinal study. Certain measures must be taken to minimize the retest effect before perceived usability retesting. It is necessary to adopt an alternate-form SUS, which meets the design principles for retesting duplicates (Widaman et al., 2010; Arendasy and Sommer, 2017; Benedict and Zgaljardic, 1998). In our retest, we measured the same task in alternate forms of SUS. We modified the SUS questionnaire for retesting according to the several rules as described below. The following are several important aspects to our work: (i) The SUS questionnaire is commonly utilized and accepted as effective, so we used it to complete our test after making appropriate modifications. Because the questionnaire is short and used repeatedly over a micro-perspective granularity, we redesigned duplicate copies (alternate-form) of the standard SUS questionnaire to eliminate retest effects. (ii) We designed two tests to confirm that the duplicate copy (retest) we designed eliminates item memory effects. The alternate-form reliability test demonstrated consistency between the duplicate and the original questionnaire. (iii) We designed Experiment II to observe the longitudinal changes in perceived usability retrospective assessments at different usability levels with the alternate-form SUS as a measuring tool. We selected five measurement points for both the micro-perspective granularity (5 min) and a longer granularity (1 week). (iv) We observed the longitudinal changes in each item’s scores in the SUS questionnaire. Table 1. Standardized SUS scale and duplicate copy (note: comparison between English and Chinese provided; in the experiment, Chinese expressions were used) Open in new tab Table 1. Standardized SUS scale and duplicate copy (note: comparison between English and Chinese provided; in the experiment, Chinese expressions were used) Open in new tab 3. Production and testing of measuring tools (SUS duplicate) In this study, we referred to the retest modification principle (Widaman et al., 2010) and modified SUS versions such as the active SUS (Sauro and Lewis, 2011), word substitution (Finstad, 2006; Sauro, 2011) and advantages of the 7-point scale (Finstad, 2010a). We modified the SUS questionnaire for retesting according to the following rules: (i) Substitution of synonymous sentences: Substitution methods include synonym substitution and sentence expression adjustment without altering positive or negative tone. This makes items more difficult to recognize. (ii) Increasing point-scale grade: The SUS works on a 5-point scale, which we modified to a 7-point scale to reduce memory effects by increasing the number of memory modules in the subjects. There is no significant difference between the 5- and 7-point scale—7 points may even be more appropriate (Dawes, 2012; Finstad, 2010a). (iii) The standard version of SUS is calculated based on a 5-point scale, so we made adjustments to the score calculation to account for the 7-point scale. Positive tone items were scored as |$(\mathrm{X}-1)\ast 1.667$| and negative items as|$(7-\mathrm{X})\ast 1.667$|⁠. The total score is 100. (iv) Disorder in item tone within pairs: Disordered sequential arrangement can reduce the serial position effect of memory. On the standard SUS questionnaire, odd and even item tones are opposite each other. To ensure reliability in the copies, we arranged items again in intervals of positive and negative tones but with random permutations. The standardized SUS and duplicate copy are shown in Table 1, and the 7-point scale is shown in Table 2. FIGURE 1 Open in new tabDownload slide Original SUS and duplicate SUS questionnaire score change trends. FIGURE 1 Open in new tabDownload slide Original SUS and duplicate SUS questionnaire score change trends. Table 2. Chinese expression of 7-point scale Open in new tab Table 2. Chinese expression of 7-point scale Open in new tab FIGURE 2 Open in new tabDownload slide Flow chart of Test II (test of item memory effect). FIGURE 2 Open in new tabDownload slide Flow chart of Test II (test of item memory effect). The SUS duplicate test consists of two parts. The first part is an alternate-form reliability test, and the second part verifies whether the alternate-form can eliminate the item memory effect. 3.1. Test I: Alternate-form reliability test 3.1.1. Subjects We selected 90 subjects, aged 19–37 years, from senior undergraduate or graduate students and by online recruiting. Participants were from design or computer-related majors or employed at IT firms. All subjects by online recruiting had undergraduate or graduate degrees. Participants self-reported completion of usability tests (objective measures or subjective evaluations) for different systems. Sixty-two have used standardized questionnaires sometimes. Most subjects were tested by email or online, while a few were tested by paper. The experimental assistant contacted the subjects by telephone in advance, confirmed the test time and recorded the test process. Each subject received a 10-yuan reward at the end of the experiment. 3.1.2. Materials The original SUS and the duplicate form constitute the two test questionnaires through different orders. The SUS-duplicate and duplicate-SUS were distributed as both paper and electronic versions. The test material is the Taobao APP (the most widely used shopping network, accounting for 69% of the online sales share in 2017, China) and there is no restriction on the usage platform. 3.1.3. Experimental design We designed the experiment as a hybrid, randomized two-group pretest-posttest, but two questionnaires were completed at the same time. 3.1.4. Test process Subjects reported having used the Taobao APP for a long time. Retrospective evaluations are more stable after repeated use, so we used retrospective evaluation. Experts first completed an assessment and recorded the assessment time (about 110 s). We asked whether the subjects had filled in a similar scale prior to this study—if not, they were given two practice tests. Then they completed the retrospective test and recorded the completion time and answers. 3.1.5. Results and discussion We next screened our results. Some of the subjects used a retrospective test based on Internet survey tools, where there was no experimenter to observe the evaluation process; the completion time was taken as the basis for the evaluation. If the questionnaires used by the subjects were <25% of the time consumed by the experts, it was believed that the subjects did not read the questions carefully and instead filled them out randomly. Nine questionnaires were abandoned for this reason leaving 81 valid questionnaires. The original results of the experiment and the changing trends are shown in Fig. 1. The Pearson correlation coefficient (alternative-form reliability) is r = 0.814, P < 0.001—the two questionnaires had significant correlation and the retest reliability was good. These results also indicate that the SUS duplicate can be used as an ideal alternative form to complete retest. The SUS scores were, however, affected by item-level memory effect. The subjects completed similar Likert-type questionnaires before, so the familiarity hypothesis could be excluded. The 7-point scale method is also widely used, and there is no report claiming that the subjects used multiple filling-out strategies. The SUS had only 10 questions, and it was easy for the subjects to remember the test answers when the interval between retests was very short. Accordingly, we sought to determine whether the duplicate could prevent item memory and designed a second test. 3.2. Test II: Test of item memory effect The subjects were easily influenced by inherent attributes such as esthetics and familiarity with the new software. It is difficult to separate these attributes from the item memory effect in the retest. To minimize the impact of other factors in software testing, we designed an item answer memory test based on the cued-recall paradigm. 3.2.1. Subjects We chose 64 subjects (two batches), all of whom were from design major or related majors, including 47 undergraduates and 17 graduate students. The subjects self-reported that they were familiar with the Likert-type scale and have used it to complete subjective usability tests or other tests. A gift of 15 yuan was distributed to each subject after his or her participation. 3.2.2. Materials We distributed standardized SUS questionnaires in pretest-posttest tests to the control group (A). In experimental group (B), the standardized SUS questionnaire was used in the pretest, and the duplicate questionnaire was used in the posttest. We also used a 7-point scale for the standard SUS to compare the experimental results. 3.2.3. Experimental design We designed the experiment as a hybrid, randomized two-group pretest–posttest. The pretest–posttest was used as the within-subject factor; the duplicate copy SUS and the standard SUS were used as between-subject factors. Sixty-four subjects were pseudo-randomly divided into Group A (control) and Group B (experimental). Each group completed a pretest–posttest test each with the same number subjects. Group A was tested on the standard SUS (pretest)-standard SUS (posttest), and Group B was tested on the standard SUS (pretest)-duplicate copy SUS (posttest). 3.2.4. Procedure The experimental test was completed in the engineering laboratory of the Industrial Design Department, HUST. We first instructed the subjects in filling out the questionnaire, and then distributed questionnaires to the two groups. The subjects did not complete any subjective evaluation, and no system was available for reference. They responded to each item of the questionnaire (10 in total) in sequence according to the answers. We recorded instructions in advance (`fill in ``3” if you slightly disagree, ``5” if you agree … etc.’) and allotted sufficient time for the subjects to understand them. The experiment assistant played the instructions, and then the subjects filled out the questionnaires accordingly. To prevent them from reciting the answers, subjects were allowed to listen to instructions only once. The interval (reading time) between the items was 4–6 s. The pretest was completed according to speech prompts. The recorded answers were selected from two real evaluations of the system. The answers include real usability evaluations of Adobe After Effects software by two usability experts. The purpose of using real evaluations was to prevent extreme evaluation options in the instructions, because extreme evaluations can lead to more memorable answers. The answer sequence was (4355231463; 3521463525); pairs of answers were used alternately. During the pretest and posttest, subjects were required to do a `distraction’ task (two intelligence test questions). The flow chart of procedure is shown in Fig. 2. The subjects were asked to complete the posttest after being given 5 min to complete the pretest. As mentioned above, Group A used the standard SUS questionnaire, and Group B was given the duplicate copy. We asked the subjects to fill in the posttest questionnaire on the basis of their recall of the pretest (with the posttest questionnaire as a memory cue). The completion process is completed according to the voice prompt, ensuring that the filling time of pretest and posttest is consistent. Unlike the pretest, the voice prompt did not play the answer. We also prerecorded a voice recording with instructions (`Please fill in the answer to the first question … Please fill in the answer to the second question … etc.’) to ensure that pretest and posttest were filled out within the same amount of time, the interval between the questionnaire items was again 4–6 s. 3.2.5. Results and discussion All subjects completed the test. The SUS scores were not calculated, because even though the final score was the same as the standard answer, there were many differences among the 10 items. Some items scored less than standard answer, while some items scored higher, though the final score was equal to the standard answer. Our statistics deviate from the absolute value between the item number and the standard answer options. The difference between the options represents the deviation from the distance, which may be positive or negative. The results are shown in Fig. 3. FIGURE 3 Open in new tabDownload slide Deviation between pretest and posttest answers in group A and group B. Longitudinal axis represents degree of deviation between pretest and posttest. `0’ indicates no deviation; absolute value of other numbers represents distance from deviation; horizontal axis represents SUS questionnaire item number FIGURE 3 Open in new tabDownload slide Deviation between pretest and posttest answers in group A and group B. Longitudinal axis represents degree of deviation between pretest and posttest. `0’ indicates no deviation; absolute value of other numbers represents distance from deviation; horizontal axis represents SUS questionnaire item number FIGURE 4 Open in new tabDownload slide Task interface: Task (i) using Premiere CS6 and Task (ii) using iJianJi. FIGURE 4 Open in new tabDownload slide Task interface: Task (i) using Premiere CS6 and Task (ii) using iJianJi. The answer distance deviation frequency is shown in Table 3. Table 3. Distance deviation frequency of Group A and Group B (`0’ indicates no deviation; absolute value of other numbers represents distance from deviation). Distance deviation absolute value . Group A . Group B . 0 98 49 1 97 87 2 53 67 3 17 47 ≥4 5 20 Total 270 270 Distance deviation absolute value . Group A . Group B . 0 98 49 1 97 87 2 53 67 3 17 47 ≥4 5 20 Total 270 270 Open in new tab Table 3. Distance deviation frequency of Group A and Group B (`0’ indicates no deviation; absolute value of other numbers represents distance from deviation). Distance deviation absolute value . Group A . Group B . 0 98 49 1 97 87 2 53 67 3 17 47 ≥4 5 20 Total 270 270 Distance deviation absolute value . Group A . Group B . 0 98 49 1 97 87 2 53 67 3 17 47 ≥4 5 20 Total 270 270 Open in new tab Because we used a 7-point scale, there was a gradual increase in distance from the original answer compared with the general qualitative statistics. The closer the answer distance, the more accurate the subject’s recall; a farther distance indicates a more `ambiguous’ memory. The results of ordinal qualitative data analysis are shown in Table 4. Table 4. The results of ordinal qualitative data analysis of test II. Mann–Whitney test . Type . Mean rank . Sum of ranks . Group A 311.03 83 977.5 Group B 229.97 62 092.5 Test statistics Mann–Whitney U 25 507.500 Wilcoxon W 62 092.500 Z −6.267 Asymp. Sig. (two-tailed) .000 Mann–Whitney test . Type . Mean rank . Sum of ranks . Group A 311.03 83 977.5 Group B 229.97 62 092.5 Test statistics Mann–Whitney U 25 507.500 Wilcoxon W 62 092.500 Z −6.267 Asymp. Sig. (two-tailed) .000 Open in new tab Table 4. The results of ordinal qualitative data analysis of test II. Mann–Whitney test . Type . Mean rank . Sum of ranks . Group A 311.03 83 977.5 Group B 229.97 62 092.5 Test statistics Mann–Whitney U 25 507.500 Wilcoxon W 62 092.500 Z −6.267 Asymp. Sig. (two-tailed) .000 Mann–Whitney test . Type . Mean rank . Sum of ranks . Group A 311.03 83 977.5 Group B 229.97 62 092.5 Test statistics Mann–Whitney U 25 507.500 Wilcoxon W 62 092.500 Z −6.267 Asymp. Sig. (two-tailed) .000 Open in new tab We found a significant difference between the SUS and duplicate in the retest results and in the deviation distance from original answers. Two groups of experimental data (pretest and posttest) were subjected to paired sample t-tests. We found that the differences in Group A were not significant (⁠|$t=-1.230,P=0.220,P>0.05\Big)$|⁠. The Person correlation coefficient is (⁠|$r=0.504,P<0.001$|⁠), which indicates that memory effects were significant. The differences in Group B were significant (⁠|$t=-3.883,P<0.001$|⁠) and the Person correlation coefficient is (⁠|$r=-0.016,P=0.793,P>0.05$|⁠), suggesting that the memory effects were not significant. In other words, the duplicate copy questionnaire used in this study significantly reduced memory effects. 4. Experiment II 4.1. Subjects Forty-four subjects participated in the experiment, none of whom had previously utilized the testing software. The subjects were 19–28 years of age, 16 male and 28 female, with no color-blindness. The subjects came from the industrial department, have >2 years of graphic image processing experience, self-reported have filled out similar questionnaires (UMUX\USE\SUS) for UX or usability testing (objective measures or subjective evaluations) in the past. In the SUS questionnaire, we found that Item 6 was somewhat ambiguously worded in the original Chinese. Some researchers also considered Item 8 to be difficult to understand (Bangor et al., 2009; Finstad, 2006). We explained the precise meaning of each item and the 7-point scale to each subject to resolve these issues. The subjects self-reported that they were able to effectively understand the 7-point scale rating and the meaning of each SUS item. They were randomly divided into two groups of 22 each. One group was assigned to complete Task (i), the other was assigned to complete Task (ii). The subjects were given a gift in exchange for their participation. 4.2. Experimental material Our experimental materials include Premiere CS6 and iJianJi video editing software deployed in the Windows operating system. The task interface are shown in Fig. 4. The basic function of the two software programs is the same: they can be used to complete a segment of video and audio editing, add special effects, render output, change format and so on. As per the official description of the two programs, users of Premiere CS6 are generally considered to be video editing enthusiasts and professionals, while iJianJi users are primary users or novices with no need for basic knowledge of video editing prior to operating the program. The experimental tasks encompass a landscape video clip, color-matching test and output. Task (i) was conducted through Premiere CS6 and Task (ii) through iJianJi. The tasks target the same properties—only the usability levels differ. For this experiment, we chose six usability experts able to skillfully use these two software programs (over 2 years of experience) who had already pre-completed the SUS perceived usability test (average scores of 37.5 and 65.8, respectively). According to the adjective ratings of SUS scores (Bangor et al., 2009), usability experts suggest that Premiere CS6 and iJianJi to represent close to `Poor’ and `Good’ metrics. The software interface of the tasks is shown in Fig. 3. The task process ran as follows: Step 1: Import the video file. Step 2: Drag the video file into edit track. Step 3: File length check and split; delete excess parts. Step 4: Adjust special effects including resizing and brightness. Step 5: Set the render format and output the video. 4.3. Experimental design We applied a randomized two-group mixed design in this experiment. This includes a between-subjects variable (task usability level) and a within-subject variable (time) encompassing five measurement-point levels. 4.4. Measurement point As discussed above, we divided the experiment into two tasks according to two different levels of usability. The lower is Task (i) and the higher is Task (ii). We set five measurement points non-linearly. According to the micro-perspective and meso-perspective granularity general setting time and the time measurement points of the retrospective evaluation (DRM), we set the points as follows: |${\mathrm{T}}_1$|⁠: Measurements at the end of the task; |${\mathrm{T}}_2$|⁠:5 min after the end of |${\mathrm{T}}_1$| point; |${\mathrm{T}}_3$|⁠:1 h after the end of |${\mathrm{T}}_1$| point; |${\mathrm{T}}_4$|⁠:24 h after the end of |${\mathrm{T}}_1$| point; and |${\mathrm{T}}_5$|⁠:7 days after the end of |${\mathrm{T}}_1$| point. |${\mathrm{T}}_1$| is a standard procedure for measuring perceived usability, which begins immediately upon completion of the task. Because emotions change dramatically in a matter of minutes (Schäfer et al., 2014; Torres-Eliard et al., 2011), we set |${\mathrm{T}}_2$| for 5 min. The |${\mathrm{T}}_3$|,|${\mathrm{T}}_4$| and |${\mathrm{T}}_5$| points were set by referencing several previous longitudinal studies. We used different duplicate copies of the SUS at different measurement points. We asked subjects to use the system only once at |${\mathrm{T}}_1$| in order to observe changes in their retrospective evaluations. The questionnaires were numbered |${\mathrm{Q}}_1$|,|${\mathrm{Q}}_2$|,|${\mathrm{Q}}_3$|,|${\mathrm{Q}}_4$|⁠, and |${\mathrm{Q}}_5$|⁠. |${\mathrm{Q}}_1$| is the standard SUS questionnaire and|${\mathrm{Q}}_2$|,|${\mathrm{Q}}_3$|,|${\mathrm{Q}}_4$|⁠, and |${\mathrm{Q}}_5$| are the copies. The four questionnaires were all modified according to the rules extracted from Experiment I, and each of them was different. To reduce carryover effects (Bordens and Abbott, 2008), we put the questionnaires in a counterbalanced arrangement. Four copies were administered sequentially in Latin square order. 4.5. Experimental process The experiment was carried out in the usability engineering laboratory of the Industrial Design Department, HUST. We began by explaining the purpose of the task and providing instructions for filling out the SUS questionnaire. To prevent the subjects from excessive concern over their answers to |${\mathrm{T}}_1$|⁠, we carefully explained that they would be providing more than one answer after |${\mathrm{T}}_2$|⁠. Subjects were asked to complete the questionnaires within the |${\mathrm{T}}_1$|,|${\mathrm{T}}_2$|and |${\mathrm{T}}_3$| measurement points in the laboratory. The specific procedure was as follows. Step 1: Subjects were asked to watch a 3-min video containing a brief introduction to the task (for detailed steps, see Section 4.2), introduction to the video material, document attributes and location and some basic terms and rules of operation. Software experts participated as technical advisers in creating this explanatory video. Step 2: After the task was complete, the subjects completed questionnaires in |${\mathrm{T}}_1$|–|${\mathrm{T}}_3$| measurement points. In intervals between measurement points, the subjects were free to rest. Table 5. Cronbach’s Alpha coefficient of questionnaires at five measurement points. Measurement points . Task (i) . Task (ii) . |${\mathrm{T}}_1$| 0.856 0.836 |${\mathrm{T}}_2$| 0.854 0.810 |${\mathrm{T}}_3$| 0.889 0.850 |${\mathrm{T}}_4$| 0.861 0.810 |${\mathrm{T}}_5$| 0.860 0.814 Measurement points . Task (i) . Task (ii) . |${\mathrm{T}}_1$| 0.856 0.836 |${\mathrm{T}}_2$| 0.854 0.810 |${\mathrm{T}}_3$| 0.889 0.850 |${\mathrm{T}}_4$| 0.861 0.810 |${\mathrm{T}}_5$| 0.860 0.814 Open in new tab Table 5. Cronbach’s Alpha coefficient of questionnaires at five measurement points. Measurement points . Task (i) . Task (ii) . |${\mathrm{T}}_1$| 0.856 0.836 |${\mathrm{T}}_2$| 0.854 0.810 |${\mathrm{T}}_3$| 0.889 0.850 |${\mathrm{T}}_4$| 0.861 0.810 |${\mathrm{T}}_5$| 0.860 0.814 Measurement points . Task (i) . Task (ii) . |${\mathrm{T}}_1$| 0.856 0.836 |${\mathrm{T}}_2$| 0.854 0.810 |${\mathrm{T}}_3$| 0.889 0.850 |${\mathrm{T}}_4$| 0.861 0.810 |${\mathrm{T}}_5$| 0.860 0.814 Open in new tab Step 3: After the subjects completed the |${\mathrm{T}}_1$|–|${\mathrm{T}}_3$| measurements, they were asked to participate in the |${\mathrm{T}}_4$| and |${\mathrm{T}}_5$| measurements in the laboratory. If the subjects were absent for any reason, they completed the investigation by email within the same time frame as their in-laboratory counterparts. Table 6. Test results at five measurement points of perceived usability. . Task (i) . . Task (ii) . No. . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . 1 48.34 46.68 45.01 40.01 48.34 70.01 66.68 68.35 71.68 75.02 2 46.68 40.01 38.34 38.34 45.01 66.68 75.02 70.01 58.35 71.68 3 31.67 20.00 23.34 23.34 25.01 76.68 78.35 83.35 68.35 78.35 4 31.67 25.01 25.01 25.01 26.67 55.01 63.35 65.01 58.35 56.68 5 30.01 21.67 28.34 23.34 43.34 73.35 68.35 68.35 63.35 65.01 6 35.01 38.34 45.01 40.01 45.01 53.34 50.01 48.34 51.68 45.01 7 40.01 31.67 38.34 38.34 35.01 51.68 56.68 60.00 56.68 51.68 8 38.34 33.34 51.68 55.01 53.34 40.01 43.34 43.34 45.01 48.34 9 55.01 48.34 50.01 45.01 46.68 71.68 76.68 75.02 68.35 70.01 10 63.35 53.34 56.68 51.68 51.68 75.02 71.68 73.35 68.35 66.68 11 55.01 51.68 53.34 46.68 55.01 55.01 66.68 68.35 65.01 58.35 12 25.01 28.34 28.34 26.67 33.34 46.68 51.68 50.01 48.34 46.68 13 20.00 8.34 10.00 13.34 6.67 65.01 73.35 75.02 68.35 66.68 14 36.67 28.34 18.34 16.67 65.01 68.35 75.02 75.02 60.01 63.35 15 55.01 48.34 46.68 50.01 43.34 48.34 53.34 58.35 48.34 55.01 16 38.34 28.34 35.01 35.01 35.01 76.68 71.68 73.35 71.68 68.35 17 20.00 16.67 15.00 15.00 20.00 65.01 73.35 75.02 66.68 66.68 18 45.01 30.01 35.01 33.34 43.34 63.35 63.35 58.35 53.34 55.01 19 36.67 15.00 13.34 23.34 46.68 63.35 66.68 65.01 58.35 63.35 20 30.01 31.67 28.34 33.34 31.67 78.35 80.02 78.35 73.35 66.68 21 25.01 20.00 18.34 35.01 46.68 66.68 60.01 56.68 56.68 53.34 22 48.34 45.01 51.68 55.01 55.01 58.35 63.35 68.35 53.34 56.68 . Task (i) . . Task (ii) . No. . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . 1 48.34 46.68 45.01 40.01 48.34 70.01 66.68 68.35 71.68 75.02 2 46.68 40.01 38.34 38.34 45.01 66.68 75.02 70.01 58.35 71.68 3 31.67 20.00 23.34 23.34 25.01 76.68 78.35 83.35 68.35 78.35 4 31.67 25.01 25.01 25.01 26.67 55.01 63.35 65.01 58.35 56.68 5 30.01 21.67 28.34 23.34 43.34 73.35 68.35 68.35 63.35 65.01 6 35.01 38.34 45.01 40.01 45.01 53.34 50.01 48.34 51.68 45.01 7 40.01 31.67 38.34 38.34 35.01 51.68 56.68 60.00 56.68 51.68 8 38.34 33.34 51.68 55.01 53.34 40.01 43.34 43.34 45.01 48.34 9 55.01 48.34 50.01 45.01 46.68 71.68 76.68 75.02 68.35 70.01 10 63.35 53.34 56.68 51.68 51.68 75.02 71.68 73.35 68.35 66.68 11 55.01 51.68 53.34 46.68 55.01 55.01 66.68 68.35 65.01 58.35 12 25.01 28.34 28.34 26.67 33.34 46.68 51.68 50.01 48.34 46.68 13 20.00 8.34 10.00 13.34 6.67 65.01 73.35 75.02 68.35 66.68 14 36.67 28.34 18.34 16.67 65.01 68.35 75.02 75.02 60.01 63.35 15 55.01 48.34 46.68 50.01 43.34 48.34 53.34 58.35 48.34 55.01 16 38.34 28.34 35.01 35.01 35.01 76.68 71.68 73.35 71.68 68.35 17 20.00 16.67 15.00 15.00 20.00 65.01 73.35 75.02 66.68 66.68 18 45.01 30.01 35.01 33.34 43.34 63.35 63.35 58.35 53.34 55.01 19 36.67 15.00 13.34 23.34 46.68 63.35 66.68 65.01 58.35 63.35 20 30.01 31.67 28.34 33.34 31.67 78.35 80.02 78.35 73.35 66.68 21 25.01 20.00 18.34 35.01 46.68 66.68 60.01 56.68 56.68 53.34 22 48.34 45.01 51.68 55.01 55.01 58.35 63.35 68.35 53.34 56.68 Open in new tab Table 6. Test results at five measurement points of perceived usability. . Task (i) . . Task (ii) . No. . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . 1 48.34 46.68 45.01 40.01 48.34 70.01 66.68 68.35 71.68 75.02 2 46.68 40.01 38.34 38.34 45.01 66.68 75.02 70.01 58.35 71.68 3 31.67 20.00 23.34 23.34 25.01 76.68 78.35 83.35 68.35 78.35 4 31.67 25.01 25.01 25.01 26.67 55.01 63.35 65.01 58.35 56.68 5 30.01 21.67 28.34 23.34 43.34 73.35 68.35 68.35 63.35 65.01 6 35.01 38.34 45.01 40.01 45.01 53.34 50.01 48.34 51.68 45.01 7 40.01 31.67 38.34 38.34 35.01 51.68 56.68 60.00 56.68 51.68 8 38.34 33.34 51.68 55.01 53.34 40.01 43.34 43.34 45.01 48.34 9 55.01 48.34 50.01 45.01 46.68 71.68 76.68 75.02 68.35 70.01 10 63.35 53.34 56.68 51.68 51.68 75.02 71.68 73.35 68.35 66.68 11 55.01 51.68 53.34 46.68 55.01 55.01 66.68 68.35 65.01 58.35 12 25.01 28.34 28.34 26.67 33.34 46.68 51.68 50.01 48.34 46.68 13 20.00 8.34 10.00 13.34 6.67 65.01 73.35 75.02 68.35 66.68 14 36.67 28.34 18.34 16.67 65.01 68.35 75.02 75.02 60.01 63.35 15 55.01 48.34 46.68 50.01 43.34 48.34 53.34 58.35 48.34 55.01 16 38.34 28.34 35.01 35.01 35.01 76.68 71.68 73.35 71.68 68.35 17 20.00 16.67 15.00 15.00 20.00 65.01 73.35 75.02 66.68 66.68 18 45.01 30.01 35.01 33.34 43.34 63.35 63.35 58.35 53.34 55.01 19 36.67 15.00 13.34 23.34 46.68 63.35 66.68 65.01 58.35 63.35 20 30.01 31.67 28.34 33.34 31.67 78.35 80.02 78.35 73.35 66.68 21 25.01 20.00 18.34 35.01 46.68 66.68 60.01 56.68 56.68 53.34 22 48.34 45.01 51.68 55.01 55.01 58.35 63.35 68.35 53.34 56.68 . Task (i) . . Task (ii) . No. . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . . |${\mathrm{T}}_1$| . |${\mathrm{T}}_2$| . |${\mathrm{T}}_3$| . |${\mathrm{T}}_4$| . |${\mathrm{T}}_5$| . 1 48.34 46.68 45.01 40.01 48.34 70.01 66.68 68.35 71.68 75.02 2 46.68 40.01 38.34 38.34 45.01 66.68 75.02 70.01 58.35 71.68 3 31.67 20.00 23.34 23.34 25.01 76.68 78.35 83.35 68.35 78.35 4 31.67 25.01 25.01 25.01 26.67 55.01 63.35 65.01 58.35 56.68 5 30.01 21.67 28.34 23.34 43.34 73.35 68.35 68.35 63.35 65.01 6 35.01 38.34 45.01 40.01 45.01 53.34 50.01 48.34 51.68 45.01 7 40.01 31.67 38.34 38.34 35.01 51.68 56.68 60.00 56.68 51.68 8 38.34 33.34 51.68 55.01 53.34 40.01 43.34 43.34 45.01 48.34 9 55.01 48.34 50.01 45.01 46.68 71.68 76.68 75.02 68.35 70.01 10 63.35 53.34 56.68 51.68 51.68 75.02 71.68 73.35 68.35 66.68 11 55.01 51.68 53.34 46.68 55.01 55.01 66.68 68.35 65.01 58.35 12 25.01 28.34 28.34 26.67 33.34 46.68 51.68 50.01 48.34 46.68 13 20.00 8.34 10.00 13.34 6.67 65.01 73.35 75.02 68.35 66.68 14 36.67 28.34 18.34 16.67 65.01 68.35 75.02 75.02 60.01 63.35 15 55.01 48.34 46.68 50.01 43.34 48.34 53.34 58.35 48.34 55.01 16 38.34 28.34 35.01 35.01 35.01 76.68 71.68 73.35 71.68 68.35 17 20.00 16.67 15.00 15.00 20.00 65.01 73.35 75.02 66.68 66.68 18 45.01 30.01 35.01 33.34 43.34 63.35 63.35 58.35 53.34 55.01 19 36.67 15.00 13.34 23.34 46.68 63.35 66.68 65.01 58.35 63.35 20 30.01 31.67 28.34 33.34 31.67 78.35 80.02 78.35 73.35 66.68 21 25.01 20.00 18.34 35.01 46.68 66.68 60.01 56.68 56.68 53.34 22 48.34 45.01 51.68 55.01 55.01 58.35 63.35 68.35 53.34 56.68 Open in new tab 4.6. Results All subjects completed the two evaluations in a timely manner. The evaluation methods included a paper version (in-laboratory) and electronic version (email). Each subject received five questionnaires with 10 related to each task (220 in total). Using SUS of perceived usability measurement, we calculated the mean values of the two tasks as 38.8 and 63.1. The task usability levels (Bangor et al., 2009) are consistent with the intrinsic usability given by the experts, i.e. approach `Poor’ and `Good’. Two-way ANOVA analysis indicated significant differences in the between-subject variable (task usability level) (⁠|$F=59.887,P<0.01\Big)$|⁠. Multiple comparison results (Tukey HSD) indicated no difference between each group (from |${\mathrm{T}}_1$| to |${\mathrm{T}}_5$|⁠) (⁠|$P>0.05$|⁠). One-way ANOVA results indicated that the main effect of within-subject variable (time) was not significant. For Task (i) (⁠|$F=1.646,P=0.168,P>0.05$|⁠), multiple comparison test (Tukey HSD) indicated that there is no difference between each group (from |${\mathrm{T}}_1$| to |${\mathrm{T}}_5$|⁠) (⁠|$P>0.05$|⁠). For Task (ii) (⁠|$F=1.508,P=0.205,P>0.05$|⁠), multiple comparison (Tukey HSD) indicated no difference between each group (from |${\mathrm{T}}_1$| to |${\mathrm{T}}_5$|⁠) (⁠|$P>0.05$|⁠). Because the questionnaire contains copies of SUS, we confirmed its reliability per the Cronbach’s Alpha coefficient of the questionnaires at the five measurement points (Table 5). The minimum acceptable Cronbach’s Alpha coefficient is generally considered to be 0.7. Although the values we measured are not as high as 0.92 in the English version (Bangor et al., 2008), they are basically in line with the SUS of various localized translations (Blazica and Lewis, 2015; AlGhannam et al., 2017; Borkowska and Jach, 2017). The results are acceptable considering that the 7-point was used instead of the 5-point scale. The results of the experiment are provided in Table 6. The average line chart is shown in Fig. 5. The measurement points are not equidistant, so the values shown in this figure are also not equally spaced. The scores were compared by means of paired sample t-tests. The results are shown in Table 7. FIGURE 5 Open in new tabDownload slide Average line chart of results in five measurement points. The error bars represent the SD. FIGURE 5 Open in new tabDownload slide Average line chart of results in five measurement points. The error bars represent the SD. Table 7. Paired sample t-tests in adjacent measurement points |${\mathrm{T}}_1$|⁠,|${\mathrm{T}}_2$|⁠,|${\mathrm{T}}_3$|⁠,|${\mathrm{T}}_4$| and|${\mathrm{T}}_5$|⁠. . Task (i) . Task (ii) . Pairs . Pearson’s coefficient . t . Sig. . Pearson’s coefficient . t . Sig. . |${\mathrm{T}}_1$|–|${\mathrm{T}}_2$| 0.892 5.323 0.000 0.875 −2.444 0.023 |${\mathrm{T}}_2$|–|${\mathrm{T}}_3$| 0.831 −1.725 0.099 0.96 −0.611 0.548 |${\mathrm{T}}_3$|–|${\mathrm{T}}_4$| 0.941 −0.326 0.747 0.852 4.839 0.000 |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$| 0.67 −2.452 0.023 0.826 −0.608 0.550 . Task (i) . Task (ii) . Pairs . Pearson’s coefficient . t . Sig. . Pearson’s coefficient . t . Sig. . |${\mathrm{T}}_1$|–|${\mathrm{T}}_2$| 0.892 5.323 0.000 0.875 −2.444 0.023 |${\mathrm{T}}_2$|–|${\mathrm{T}}_3$| 0.831 −1.725 0.099 0.96 −0.611 0.548 |${\mathrm{T}}_3$|–|${\mathrm{T}}_4$| 0.941 −0.326 0.747 0.852 4.839 0.000 |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$| 0.67 −2.452 0.023 0.826 −0.608 0.550 Open in new tab Table 7. Paired sample t-tests in adjacent measurement points |${\mathrm{T}}_1$|⁠,|${\mathrm{T}}_2$|⁠,|${\mathrm{T}}_3$|⁠,|${\mathrm{T}}_4$| and|${\mathrm{T}}_5$|⁠. . Task (i) . Task (ii) . Pairs . Pearson’s coefficient . t . Sig. . Pearson’s coefficient . t . Sig. . |${\mathrm{T}}_1$|–|${\mathrm{T}}_2$| 0.892 5.323 0.000 0.875 −2.444 0.023 |${\mathrm{T}}_2$|–|${\mathrm{T}}_3$| 0.831 −1.725 0.099 0.96 −0.611 0.548 |${\mathrm{T}}_3$|–|${\mathrm{T}}_4$| 0.941 −0.326 0.747 0.852 4.839 0.000 |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$| 0.67 −2.452 0.023 0.826 −0.608 0.550 . Task (i) . Task (ii) . Pairs . Pearson’s coefficient . t . Sig. . Pearson’s coefficient . t . Sig. . |${\mathrm{T}}_1$|–|${\mathrm{T}}_2$| 0.892 5.323 0.000 0.875 −2.444 0.023 |${\mathrm{T}}_2$|–|${\mathrm{T}}_3$| 0.831 −1.725 0.099 0.96 −0.611 0.548 |${\mathrm{T}}_3$|–|${\mathrm{T}}_4$| 0.941 −0.326 0.747 0.852 4.839 0.000 |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$| 0.67 −2.452 0.023 0.826 −0.608 0.550 Open in new tab Table 7 Paired samples t-tests in adjacent measurement points |${\mathrm{T}}_1$|⁠,|${\mathrm{T}}_2$|⁠,|${\mathrm{T}}_3$|⁠,|${\mathrm{T}}_4$| and|${\mathrm{T}}_5$|⁠. As shown in Table 6, there were significant differences in |${\mathrm{T}}_1$|–|${\mathrm{T}}_2$| on both tasks. On Task (i), the |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$| scores differed significantly, while |${\mathrm{T}}_2$|–|${\mathrm{T}}_3$| and |${\mathrm{T}}_3$|–|${\mathrm{T}}_4$| did not. On Task (ii), |${\mathrm{T}}_3$|–|${\mathrm{T}}_4$| differed significantly, while |${\mathrm{T}}_2$|–|${\mathrm{T}}_3$| and |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$| did not. Figure 6 shows the changes in each item at different measurement points. FIGURE 6 Open in new tabDownload slide Mean changes in items across five measuring points; 10 different points represent SUS items. Left side Task (i); right side: Task (ii). Vertical axis represents item sores; horizontal axis represents five measurement points. FIGURE 6 Open in new tabDownload slide Mean changes in items across five measuring points; 10 different points represent SUS items. Left side Task (i); right side: Task (ii). Vertical axis represents item sores; horizontal axis represents five measurement points. 5. Discussion The impact of time on usability evaluation is evident in respect to both longitudinal studies and retrospective evaluations. Many previous scholars have studied `change’ and research methods, but few have explored the common phenomenon of longitudinal changes on retrospective evaluations. We designed an experiment specifically to explore this phenomenon. Multiple-observation effects are often neglected in retrospective studies. In this study, because the measurement point interval time was relatively short, we first designed a method for retesting perceived usability based on SUS duplicate copies. Experiment I proves that the memory effects can be significantly reduced by the proposed method. Experiment II included a novel perceived usability retest, which we used to investigate the longitudinal changes in perceived usability under different usability levels. In other words, we focused on our subjects’ `impression changes’ of the system as-expressed over even a very short time granularity (5 min). This change provides novel evidence for improving perceived usability evaluation methods. Our results also may as a reference for improving the accuracy of such evaluation methods. We used the results to draw two variation curves for perceived usability under different usability levels. The results of the experiment are significant in regards to measurement time settings and improving the accuracy of perceived usability evaluations. As discussed above, previous SUS questionnaire and perceived usability researchers have tended to neglect temporal factors in their investigations (Sauro and Lewis, 2012; Assila et al., 2016). Some scholars do provide time limits within, which subjects must complete a questionnaire (Sonderegger et al., 2012), and our results confirm that perceived usability evaluation scores differ significantly within such limits. At the immediate evaluation (⁠|${\mathrm{T}}_1$|⁠), the average scores of the two tasks were 38.8 and 63.1. As per the grade definitions given by Bangor et al. (2009), the acceptability ranges of Task (i) and Task (ii) are `not acceptable’ or `marginally acceptable’ corresponding to grade scales F and D (i.e. `Poor’ and `Good’). Task (ii) scores are relatively high but still significantly lower than the total average of 70.5 and 70.14 given by Bangor et al. (2008, 2009). According to the official descriptions of the two software programs, the main users of Premiere CS6 are professionals, while iJianji users are novices and intermediate users. We did not expect usability ratings as low as F and D, but in fact, the results are similar to those reported by Kortum and Bangor (2013) in an evaluation of Excel. Many users simply lack the necessary knowledge to navigate complex multifunctional software package tools, so their evaluations can be considered inaccurate. This leads to more powerful software, but less favorable usability assessments. Ratings may be exaggerated as users center their evaluations on features with which they are already familiar and can confidently use. The software in Task (i) has more functions, most of which the subjects were unlikely to know how to use, so the software rating may have been underestimated. Of course, with the increase of usage, usability evaluations scores will increase (Macdorman et al., 2011). We also found that gaps in memory and experience (Mironshatz et al., 2009; Bruun and Ahm, 2015) do exist, both in terms of pleasant and unpleasant experiences. Such gaps also may form even in a very short period of time. At three measurement points (⁠|${\mathrm{T}}_1$|⁠,|${\mathrm{T}}_2$|⁠,|${\mathrm{T}}_3$|⁠), the results of the two tasks we ran differed significantly. Surprisingly, paired sample t-test results revealed that both Task (i) |$\Big(t=5.323,P<0.05$|⁠) and Task (ii) (⁠|$t=-2.444,P<0.05$|⁠) have significant differences in |${\mathrm{T}}_1$| to |${\mathrm{T}}_2$| (only 5 min). At the |${\mathrm{T}}_1$| to |${\mathrm{T}}_2$| moment, the perceived usability scores in Task (i) and Task (ii) were respectively positively correlated (⁠|$P>0.8$|⁠). Minge (2008) similarly found that in 15 min, the perceived usability scores of high attractiveness and low usability interface decreased, while low attractiveness and high usability interfaces increased; however, the methodology was not a retrospective evaluation, but a pretest–posttest series. While our results stand contrary to many previously published test results on perceived usability. We suggest that when the experimenter does not strictly set temporal parameters for the test, that the results do not accurately reflect the subjects’ perceived usability. For example, if the experimenter distributes questionnaire forms and debugs testing environment for even a few minutes during the allotted test-taking time, the results of the experiment will differ considerably compared to a test that has gone `smoothly’ within the allotted time. We also found that the direction of change differs across different usability levels. In previous studies, there was no strict control over test time during post-study usability measurements. In almost all studies, subjects provide measurements after the task is completed without being given any explicit temporal parameters for doing so (though some researchers do provide a time limit for completing the questionnaire). Our results suggest that it is necessary to set strict time restrictions at the perceived usability evaluation. Evaluation time should be restricted from two aspects. The time necessary from the beginning to the end of task completion must be considered. Our results indicate that perceived usability testing must start immediately after the task is completed and should not include any experimental preparation time. Additionally, there must be strict time limits for completion of usability testing. For example, according to the length of the questionnaire, the experimenter can give a reference within which the test must be completed. Memories of certain emotions are rebuilt over time. Caetano et al. (2012), Nagel et al. (2007) and Kaplan et al. (2016) have independently suggested that emotion production, arousal and attenuation can change within several minutes, but that the memory of such emotion is affected by the passage of time. With time, individuals recall less intense emotions from their past be they pleasant or unpleasant (Sheldon and Lyubomirsky, 2012; Kaplan et al., 2016). Our results do appear to be consistent with those of previous longitudinal studies and retrospective research. Perceived usability scores were attenuated between |${\mathrm{T}}_3$| and |${\mathrm{T}}_5$| measurement points. Lower perceived usability scores were elevated, whereas high perceived usability decreased and subsequently remained stable. Surprisingly, however, we also found that the attenuation velocity of lower-perceived usability was slower than that of higher-perceived usability. At |${\mathrm{T}}_3$|–|${\mathrm{T}}_4$| points, Task (i) (⁠|$t=-0.326,P=0.747,P>0.05$|⁠) while Task (ii) (⁠|$t=-0.852,P=0.000,P<0.05$|⁠). The negative emotions representative of the memory-experience gap may be larger than positive stimuli due to the influence of time. Negative effects are given more attention by the subjects (Miron-Shatz et al., 2009). Subjects may allot their attention over a larger gap between retrospective and concurrent scores (Mironshatz et al., 2009; Bruun and Ahm, 2015); their negative emotional experiences also tend to last longer and disappear more slowly. On Task (i), (⁠|${\mathrm{T}}_2$|⁠), the subjects tended to take longer to give lower scores even within the very short period of time allotted for the questionnaire. There were no significant differences at |${\mathrm{T}}_4$|⁠. At |${\mathrm{T}}_5$|⁠, the evaluation of perceived usability significantly increased (⁠|$t=2.309,P=0.031,P<0.05$|⁠). However, there was no significant difference between points |${\mathrm{T}}_5$| and |${\mathrm{T}}_1$|⁠. The reasons for the gap between |${\mathrm{T}}_1$| and |${\mathrm{T}}_2$| can be determined under the perspective of learning. As time goes on, learning causes a gradual decline in usability: usability is no longer a negative experience when users have learned to use the product. In our experiment, learning did not occur because the subjects did not continue to use the product. On Task (ii), (⁠|${\mathrm{T}}_2$|⁠), the subjects tended to give higher scores just as quickly but only up to |${\mathrm{T}}_3$|⁠; significant differences occurred in |${\mathrm{T}}_4$| (⁠|$t=4.839,P=0.000,P<0.05\Big)$|but there was no significant difference between |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$| (⁠|$t=-0.608,P=0.550,P>0.05$|⁠). This indicates that perceived usability grew stable after the |${\mathrm{T}}_4$| measurement point. We found no significant differences between|${\mathrm{T}}_5$| and |${\mathrm{T}}_1$| in the two tasks. The main effect of time was not significant on high usability or low usability tasks. Similar results were also obtained by Gao and Kortum (2017), who did not strictly define measurement points for retrospective evaluation but instead conducted a cross-sectional study with two groups of people. This may be attributable to the longitudinal study time frame. These results may also support the view of Kortum and Bangor (2013) wherein retrospective evaluations after long-term use are more realistic. The SUS questionnaire results show that after a long period of time (one week), the user’s perceived usability retest may still be consistent with the initial evaluation (⁠|${\mathrm{T}}_1$| point). This also suggests that during the perceived usability measurement process (post-study testing), if the experimenter has not set the measuring time strictly after the task, that the user will provide roughly the same SUS score regardless of the inherent usability of the task. There may be a specific time reference for the accurate evaluation of perceived usability on different inherent usability tasks; in other words, there may be an alternative to multiple time measurement points for experimenters. Conversely, if our measurement time is substituted into the |${\mathrm{T}}_1$|–|${\mathrm{T}}_5$| changing curve used in our studies, it would be possible to examine whether the perceived usability scores are subject to bias. The measurements in our study were gathered over short intervals. If the long-term usability test session falls after |${\mathrm{T}}_5$|⁠, any higher or lower perceptual usability evaluations may approach middle scores over time as per the principle of emotional decay. Negative emotions are easier to remember and overestimate (Bruun and Ahm, 2015), so we speculate that evaluation attenuation did not end after |${\mathrm{T}}_5$|⁠, especially on Task (i). Our results also preliminarily reflect this conjecture as there was significant difference in |${\mathrm{T}}_4$|–|${\mathrm{T}}_5$|⁠. This also means that the retrospective evaluation of unpleasant experiences may decline over time. However, this may not take a long time—even with continuous use of a product, reviews are likely to be stable within 1 month (Karapanos et al., 2010). In Experiment II, each group completed only one task because it was convenient to observe changes in usability evaluation of a single task over time. Experimental subjects often complete a task (or multiple tasks) many times at the same time, which may result in a mixed perceived usability (i.e. partially positive and partially negative evaluations). Mixed emotions are less accurate and powerful than unipolar emotions (Aaker et al., 2008), so multi-task retrospective evaluations may be less accurate than single task. Positive emotional situations can reduce the experience of physiological arousal to change negative emotions (Davydov et al., 2011; Braniecka et al., 2014). If there are both positive and negative emotions which arise in the process of assessment, the user’s negative evaluations may grow more positive over time resulting in higher scores compared to a single-task experiment. If the subjects complete all unipolar evaluations, because people always tend to exaggerate their past feelings of both positive and negative events (Levine et al., 2001), then we speculate that the unipolar evaluation should be stronger than mixed. However, if the number of exercises increases gradually, the usability evaluation will also change significantly as learning behavior minimizes usability problems. In the composition of fractions, the changes in SUS items were not entirely consistent—some of the items increased while some dimensions decreased or remained unchanged. Changes in different evaluation dimensions do tend to differ after a long period of experience with the test object, but the overall experience values given by the user may be approximately the same. This is not surprising, as the influence of time may be exerted across many dimensions (Kjeldskov et al., 2010); one aspect may increase while the other evaluation dimensions are reduced. The overall usability evaluation may remain unchanged (Mendoza and Novick, 2005). In Task (i), Items 1, 2, 5, 6 and 8 had much lower scores in |${\mathrm{T}}_2$| compared to |${\mathrm{T}}_1$|⁠, while Items 9 and 10 were higher. According to the two evaluation dimensions of SUS (Borsci et al., 2009; Lewis and Sauro, 2009), these changes in items represent the usability dimension. The values of nearly all items on Task (ii) increased over the course of the experiment. Changes in the usability dimension were concentrated in Items 5 and 9. The learnability dimension (Items 7 and 10) showed no significant changes. Our results indicate that user confidence changes dramatically over time. At longer-term measurement points (⁠|${\mathrm{T}}_3-{\mathrm{T}}_5$|⁠), our results were similar to previously published long-term UX results (Kjeldskov et al., 2010; Karapanos et al., 2010; Sonderegger et al., 2012). Over the entire UX evaluation process, strong usability gradually weakened over time and weak usability grew stronger. On difficult tasks, satisfaction and confidence increased over time while depression and fatigue declined. Within shorter time periods (⁠|${\mathrm{T}}_2$|⁠), scores on negative-tone items (Items 2, 6, 8) decreased. This indicates that the user’s frustration with a low-usability task can spike within a short period of time. There are many types of standardized questionnaires available for perceived usability evaluation. There are only 10 items on the SUS questionnaire, which allows subjects to report their perceived usability relatively quickly. Consider an experimenter who does not use SUS and instead uses USE or SUMI, which contain dozens of items that the user would need more than 5 min to complete. Per the results of this study, the evaluation of perceived usability may change due to the time it takes to complete this questionnaire. In high-inherit usability tasks, the scores of a few items listed at the front of the sequence may be lower than those at the end. This may also be the case in lower-inherit usability tasks, because the process of emotional attenuation is not linear but instead is analogous to exponential curves. Item scores may undergo non-linear, continuous changes with the order of presentation. Now, consider our experimental procedure. At the end of the |${\mathrm{T}}_1$| points, there appeared to be a sudden change in emotional involvement; the user’s emotional arousal may have been reactivated at the |${\mathrm{T}}_2$| point. This raises the question of whether emotion arousal and valence are the cause of perceived usability differences between Q1 and Q2. Seo et al. (2015) found that perceived usability is positively correlated with emotional valence, but there is no relationship between perceived usability and emotional arousal. However, as time goes on, the emotional arousal and valence decrease while emotion decays. Our results suggest that such perceived usability evaluations (on a long questionnaire) may be inaccurate—because the user’s perception can change very quickly, there may be error between the first and tail items. We plan to further explore these changes in a future study. 6. Conclusion and Limitations The results presented here represent a novel approach to eliminating multiple-observation effects on longitudinal assessments. We conducted a longitudinal study on a retrospective assessment of perceived usability and accounted for time granularity, which previous researchers have tended to neglect. The proposed method retains the advantages of the SUS questionnaire (timeliness, low cost and high reliability) while minimizing memory effects in the retests. Alternate-form reliability test and item memory effect test showed that the retest method of SUS duplicate copy questionnaires nearly eliminated any memory effects and have high reliability. In our longitudinal study on retrospective evaluation, we found a memory-concurrent evaluation gap in perceived usability. It is particularly noteworthy that even a very short period of time (5 min) is enough to form a difference. At different levels of usabiltiy, the direction of evaluation changes and the duration of the change differs. We believe that temporal factors must be properly accounted for in the percieved usability evaluation process to obtain accurate evaluation results. These temporal factors should include a strict assessment start time limit and restricted overall test time. Ideally, the test process should be completed quickly. This study was not without limitations. (i) We designed measurement points in a non-linear arrangement in order to observe longitudinal changes in perceived usability within a relatively short time period. Five measurement points in total were placed within the observation time; only two points related to micro-perspective granularity. A greater number of measurement points related to micro-perspective granularity may yield more accurate usability change curves, which we plan to explore in a future study. (ii) We used `Good’ and `Poor’ usability levels. Perceived usability in a short period of time (⁠|${\mathrm{T}}_2$|⁠) changed per these different levels, but the size and direction of the changes were not the same. There are five levels of perceived usability defined in the literature. In the future, we plan to experiment with the other levels to further explore the various directions of these changes. (iii) In Experiment II, the counterbalance method can be used to flatten the carryover effect between treatments. The counterbalance method is one of the frequently used retest methods. Counterbalance requires that subjects carry out various treatments in different orders to minimize transfer effect by distributing it (Bordens and Abbott, 2008). Latin squares we used can be used to ensure the appropriate order of occurrence. The counterbalance method distributes transfer effects evenly among the treatments so there is no difference between the average values of the groups (Bordens and Abbott, 2008), but may not eliminate the retest effect of one subject as the same subject must be measured many times. We should try to reduce the retest effect in the new duplicate. However, because SUS and even shorter questionnaires are used, it is difficult to completely eliminate these effects. One strategy is to reduce the memory effect of questionnaires. We are making and testing a new SUS duplicate. An additional design strategy is to modify the tone of all items with reference to the positive version (Sauro and Lewis, 2011), this means that the tone of the new copy is the opposite of the original SUS. In addition to the advantages of the method provided here, the familiarity of new questionnaire can be further reduced to reduce retest effect. We also encourage practitioners to identify and eliminate multiple-observation effects when investigating other retest methods. We also encourage other researchers to set multiple measurement points in any retrospective measurement of perceived usability and to observe the size and direction of perceived usability changes (especially in micro-perspective granularity). ACKNOWLEDGEMENTS This research is supported by the Department of Industrial Design, and the experiment was carried out in the usability engineering laboratory of the Industrial Design Department, HUST. REFERENCES La , A. and Rathod , S. ( 2005 ) Questionnaire Length & Fatigue Effects. Bloomerce White Paper, 5 . Aaker , J. L. , Drolet , A. and Griffin , D. W. ( 2008 ). Recalling mixed emotions: How did i feel again? Social Science Electronic Publishing , 35 , 268 – 278 . Albert , W. and Tullis , T. ( 2016 ) Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics: Second Edition . Beijing : Publishing House of Electronics Industry . Alghannam , B. A. , Albustan , S. A. , Alhassan , A. A. and Albustan , L. A. ( 2017 ) Towards a standard arabic system usability scale (a-sus): Psychometric evaluation using communication disorder app . Int. J. Hum. Comput. Interact . 3 , 1 – 6 . OpenURL Placeholder Text WorldCat Arendasy , M. E. and Sommer , M. ( 2013 ). Quantitative differences in retest effects across different methods used to construct alternate test forms . Intelligence , 41 , 181 – 192 . Google Scholar Crossref Search ADS WorldCat Arendasy , M. E. and Sommer , M. ( 2017 ) Reducing the effect size of the retest effect: Examining different approaches . Intelligence , 62 , 89 – 98 . Google Scholar Crossref Search ADS WorldCat Assila , A. , De , K. M. and Ezzedine , H. ( 2016 ) Standardized usability questionnaires: features and quality focus . Electronic Journal of Computer Science & Information Technology , 6 , 15 – 31 . Bangor , A. , Kortum , P. T. and Miller , J. T. ( 2008 ). An empirical evaluation of the system usability scale . Int. J. Hum. Comput. Interact. , 24 , 574 – 594 . Google Scholar Crossref Search ADS WorldCat Bangor , A. , Kortum , P. and Miller , J. ( 2009 ). Determining what individual sus scores mean: Adding an adjective rating scale . Journal of Usability Studies , 4 , 114 – 123 . OpenURL Placeholder Text WorldCat Beglinger , L. J. , Gaydos , B. , Tangphaodaniels , O. , Duff , K. , Kareken , D. A. and Crawford , J. , et al. ( 2005 ). Practice effects and the use of alternate forms in serial neuropsychological testing . Archives of Clinical Neuropsychology , 20 , 517 – 529 . Google Scholar Crossref Search ADS PubMed WorldCat Benedict , R. H. and Zgaljardic , D. J. ( 1998 ). Practice effects during repeated administrations of memory tests with and without alternate forms . Journal of Clinical & Experimental Neuropsychology , 20 , 339 – 352 . Google Scholar Crossref Search ADS WorldCat Berkman , M. I. and Karahoca , D. ( 2016 ). Re-assessing the usability metric for user experience (umux) scale . Journal of Usability Studies , 11 , 89 – 109 . OpenURL Placeholder Text WorldCat Berntsen , D. ( 1996 ) Involuntary autobiographical memory . Applied Cognitive Psychology , 10 , 455 – 460 . Google Scholar Crossref Search ADS WorldCat BlazIca , B. and Lewis , J. R. ( 2015 ). A slovene translation of the system usability scale: The sus-si . Int. J. Hum. Comput. Interact. , 31 , 112 – 117 . Google Scholar Crossref Search ADS WorldCat Bordens , K. S. and Abbott , B. B. ( 2008 ) Research design and methods 6th ed. Shanghai : Shanghai Peopleʼs Publishing House. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Borkowska , A. and Jach , K. ( 2017 ) Pre-testing of Polish Translation of System Usability Scale (SUS). In Information Systems Architecture and Technology: Proceedings of 37th International Conference on Information Systems Architecture and Technology – ISAT 2016 – Part I , 143 – 153 . New York : Springer International Publishing . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Borsci , S. , Federici , S. and Lauriola , M. ( 2009 ). On the dimensionality of the system usability scale: A test of alternative measurement models . Cogn. Process , 10 , 193 – 197 . Google Scholar Crossref Search ADS PubMed WorldCat Borsci , S. , Federici , S. , Bacci , S. , Gnaldi , M. and Bartolucci , F. ( 2015 ). Assessing user satisfaction in the era of user experience: Comparison of the sus, umux, and umux-lite as a function of product experience . Int. J. Hum. Comput. Interact. , 31 , 484 – 495 . Google Scholar Crossref Search ADS WorldCat Braniecka , A. , Trzebińska , E. , Dowgiert , A. and Wytykowska , A. ( 2014 ) Mixed emotions and coping: The benefits of secondary emotions . PLoS One , 9 , e103940 . OpenURL Placeholder Text WorldCat Brooke , J. ( 1996 ) Sus-a quick and dirty usability scale . Usability Evaluation in Industry , 189 . OpenURL Placeholder Text WorldCat Brooke , J. ( 2013 ). SUS: A retrospective . Usability Professionals' Association. 8 , 29 – 40 OpenURL Placeholder Text WorldCat Bruun , A. and Ahm , S. ( 2015 ) Mind the gap! Comparing retrospective and concurrent ratings of emotion in user experience evaluation. 15th IFIP TC13 Conference on Human-Computer Interaction (INTERACT) , vol. 9296 , pp. 237 – 254 . Heidelberg : Springer International Publishing . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Caetano , M. , Mouchtaris , A. and Wiering , F. ( 2012 ) The role of time in music emotion recognition. International Symposium, CMMR, vol. 2 , pp. 498 – 500 . Chin , J. P. , Diehl , V. A. and Norman , K. L. ( 1988 ) Development of an instrument measuring user satisfaction of the human-computer interface. Sigchi Conference on Human Factors in Computing Systems, vol. 85 , pp. 213 – 218 . New York : ACM . Chittaro , L. and Vianello , A. ( 2016 ) Evaluation of a mobile mindfulness app distributed through on-line stores . Int. J. Hum. Comput. Stud. , 86 , 63 – 80 . Google Scholar Crossref Search ADS WorldCat Davydov , D. M. , Zech , E. and Luminet , O. ( 2011 ). Affective context of sadness and physiological response patterns . Journal of Psychophysiology , 25 , 67 – 80 . Google Scholar Crossref Search ADS WorldCat Dawes , J. ( 2012 ). Do data characteristics change according to the number of scale points used? An experiment using 5-point, 7-point and 10-point scales . International Journal of Market Research , 50 , 61 – 77 . Google Scholar Crossref Search ADS WorldCat Redelmeier , D. A. and Kahneman , D. ( 1996 ). Patients' memories of painful medical treatments: Real-time and retrospective evaluations of two minimally invasive procedures . Pain , 66 , 3 . Google Scholar Crossref Search ADS PubMed WorldCat Fenko , A. , Schifferstein , H. N. J. and Hekkert , P. ( 2009 ). Shifts in sensory dominance between various stages of user–product interactions . Applied Ergonomics , 41 , 34 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat Finstad , K. ( 2006 ) The system usability scale and non-native English speakers . Journal of Usability Studies , 1 , 185 – 188 . OpenURL Placeholder Text WorldCat Finstad , K. ( 2010a ) Response interpolation and scale sensitivity: Evidence against 5-point scales . Journal of Usability Studies , 5 , 104 – 110 . OpenURL Placeholder Text WorldCat Finstad , K. ( 2010b ). The usability metric for user experience . Interacting with Computers , 22 , 323 – 327 . Google Scholar Crossref Search ADS WorldCat Forlizzi , J. and Battarbee , K. ( 2004 ) Understanding experience in interactive systems. Conference on designing interactive systems: Processes, practices, methods, and techniques, 10 , pp. 261 – 268 . New York : ACM . Gao , M. and Kortum , P. ( 2017 ). Measuring the usability of home healthcare devices using retrospective measures . Proceedings of the Human Factors and Ergonomics Society 2017 Annual Meeting , 61 , 1281 – 1285 . Google Scholar Crossref Search ADS WorldCat Goldberg , T. E. , Harvey , P. D. , Wesnes , K. A. , Snyder , P. J. and Schneider , L. S. ( 2015 ). Practice effects due to serial cognitive assessment: Implications for preclinical alzheimer's disease randomized controlled trials . Alzheimers & Dementia Diagnosis Assessment & Disease Monitoring , 1 , 103 – 111 . Google Scholar Crossref Search ADS WorldCat Haak , M. J. V. D. , Jong , M. D. T. D. and Schellens , P. J. ( 2003 ). Retrospective versus concurrent think-aloud protocols: Testing the usability of an online library catalogue. Behav inform technol 22:339 . Behaviour & Information Technology , 22 , 339 – 351 . Google Scholar Crossref Search ADS WorldCat Harbich , S. and Hassenzahl , M. ( 2017 ). User experience in the work domain: A longitudinal field study . Interacting with Computers , 29 , 306 – 324 . OpenURL Placeholder Text WorldCat Hassenzahl , M. ( 2006 ). User experience-a research agenda . Behaviour & Information Technology , 25 , 91 – 97 . Google Scholar Crossref Search ADS WorldCat Hassenzahl , M. and Monk , A. ( 2010 ). The inference of perceived usability from beauty . Hum. Comput. Interact. , 25 , 235 – 260 . Google Scholar Crossref Search ADS WorldCat Hassenzahl , M. and Sandweg , N. ( 2004 ) From mental effort to perceived usability: Transforming experiences into summary assessments. In CHI '04 Extended Abstracts on Human Factors in Computing Systems , pp. 1283 – 1286 . New York : ACM . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hassenzahl , M. and Ullrich , D. ( 2007 ). To do or not to do: Differences in user experience and retrospective judgments depending on the presence or absence of instrumental goals . Interacting with Computers , 19 , 429 – 437 . Google Scholar Crossref Search ADS WorldCat Hornbæk , K. ( 2006 ). Current practice in measuring usability: Challenges to usability studies and research . Int. J. Hum.– Comput. Stud. , 64 , 79 – 102 . Google Scholar Crossref Search ADS WorldCat ISO 9241-210:2010 Ergonomics of Human-System Interaction – Part 210: Human-Centred Design for Interactive Systems. Jones , R. N. ( 2015 ). Practice and retest effects in longitudinal studies of cognitive functioning . Alzheimers & Dementia , 1 , 101 . OpenURL Placeholder Text WorldCat Juergen Sauer and Andreas Sonderegger . ( 2011 ). The influence of product aesthetics and user state in usability testing . Behaviour & Information Technology , 30 , 787 – 796 . Google Scholar Crossref Search ADS WorldCat Kaplan , R. L. , Van Damme , I. , Levine , L. J. and Loftus , E. F. ( 2016 ). Emotion and false memory . Emotion Review , 8 , 8 – 13 . Google Scholar Crossref Search ADS WorldCat Karapanos , E. ( 2008 ) User experience over time. CHI '08 Extended Abstracts on Human Factors in Computing Systems , vol. 436 , pp. 3561 – 3566 . Florence, Italy : ACM . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Karapanos , E. , Martens , Jean-Bernard and Hassenzahl , M. ( 2012 ). Reconstructing experiences with iscale . International Journal of Human - Computer Studies , 70 , 849 – 865 . Google Scholar Crossref Search ADS WorldCat Karapanos , E. , Zimmerman , J. , Forlizzi , J. and Martens , J. B. ( 2009 ) User experience over time: An initial framework. Sigchi conference on human factors in computing systems, pp. 729 – 738 . New York : ACM . Karapanos , E. , Zimmerman , J. , Forlizzi , J. and Martens , J.-B. ( 2010 ). Measuring the dynamics of remembered experience over time . Interacting with Computers , 22 , 328 – 335 . Google Scholar Crossref Search ADS WorldCat Kim , H. K. , Han , S. H. , Park , J. and Park , W. ( 2015 ). How user experience changes over time: A case study of social network services . Human Factors & Ergonomics in Manufacturing & Service Industries , 25 , 659 – 673 . Google Scholar Crossref Search ADS WorldCat Kirakowski , J. and Corbett , M. ( 1993 ). Sumi: The software usability measurement inventory . British Journal of Educational Technology , 24 , 210 – 212 . Google Scholar Crossref Search ADS WorldCat Kjeldskov , J. , Skov , M. B. and Stage , J. ( 2005 ) Does time heal?: A longitudinal study of usability. Australia conference on computer-human interaction: Citizens online: Considerations for today and the future, pp. 1 – 10 . Computer-human interaction special interest group (CHISIG) of Australia . Kjeldskov , J. , Skov , M. B. and Stage , J. ( 2010 ). A longitudinal study of usability in health care - does time heal? Stud Health Technol Inform , 79 , 181 – 191 . OpenURL Placeholder Text WorldCat Kortum , P. T. and Bangor , A. ( 2013 ). Usability ratings for everyday products measured with the system usability scale . International Journal of Human computer Interaction , 29 , 67 – 76 . Google Scholar Crossref Search ADS WorldCat Kortum , P. and Sorber , M. ( 2015 ). Measuring the usability of mobile applications for phones and tablets . International Journal of Human computer Interaction , 31 , 518 – 529 . Google Scholar Crossref Search ADS WorldCat Kujala , S. and Miron-Shatz , T. ( 2013 ) Emotions, experiences and usability in real-life mobile phone use. Sigchi conference on human factors in computing systems, pp. 1061 – 1070 . New York : ACM Kujala , S. , Roto , V. , Karapanos , E. and Sinnelä , A. ( 2011 ). Ux curve: A method for evaluating long-term user experience . Interacting with Computers , 23 , 473 – 483 . Google Scholar Crossref Search ADS WorldCat Kujala , S. , Vogel , M. , Pohlmeyer , A. E. and Obrist , M. ( 2013 ) Lost in time: The meaning of temporal aspects in user experience. CHI '13 extended abstracts on human factors in computing systems, pp. 559 – 564 . Lallemand , C. , Gronier , G. and Koenig , V. ( 2015 ). User experience: A concept without consensus? Exploring practitioners’ perspectives through an international survey . Computers in Human Behavior , 43 ( 43 ), 35 – 48 . Google Scholar Crossref Search ADS WorldCat Levine , L. J. , Prohaska , V. , Burgess , S. L. , Rice , J. A. and Laulhere , T. M. ( 2001 ). Remembering past emotions: The role of current appraisals . Cognition & Emotion , 15 ( 4 ), 393 – 417 . Google Scholar Crossref Search ADS WorldCat Lewis , J. R. ( 2006 ) Usability Testing. Handbook of Human Factors and Ergonomics (3rd ed.). Hoboken, NJ : John Wiley Lewis , J. R. ( 2018 ) Measuring perceived usability: The csuq, sus, and umux . International Journal of Human-Computer Interaction , 3 , 1 – 9 . OpenURL Placeholder Text WorldCat Lewis , J. R. , Utesch , B. S. and Maher , D. E. ( 2013 ) UMUX-LITE: When there's no time for the SUS. In Sigchi conference on human factors in computing systems , pp. 2099 – 2102 . New York : ACM . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Lewis , J. R. , Utesch , B. S. and Maher , D. E. ( 2015 ). Measuring perceived usability: The sus, umux-lite, and altusability . International Journal of Human-Computer Interaction , 31 , 496 – 505 . Google Scholar Crossref Search ADS WorldCat Lievens , F. , Reeve , C. L. and Heggestad , E. D. ( 2007 ). An examination of psychometric bias due to retesting on cognitive ability tests in selection settings . Journal of Applied Psychology , 92 , 1672 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat Lund , AM . ( 2001 ). “Measuring usability with the USE questionnaire,” usability and user experience newsletter . STC Usability SIG , 8 , 1 – 4 . OpenURL Placeholder Text WorldCat Macdorman , K. F. , Whalen , T. J. , Ho , C. C. and Patel , H. ( 2011 ). An improved usability measure based on novice and expert performance . International Journal of Human-computer Interaction , 27 , 280 – 302 . Google Scholar Crossref Search ADS WorldCat Mahlke S. , Thüring M. . Studying antecedents of emotional experiences in interactive contexts . In Proceedings of the CHI Conference , 2007 . ACM , pp. 915 – 918 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Calamia , M. , Markon , K. and Tranel , D. ( 2012 ). Scoring higher the second time around: Meta-analyses of practice effects in neuropsychological assessment . Clinical Neuropsychologist , 26 , 543 – 570 . Google Scholar Crossref Search ADS PubMed WorldCat Mcdonald , S. , Zhao , T. and Edwards , H. M. ( 2013 ). Dual verbal elicitation: The complementary use of concurrent and retrospective reporting within a usability test . International Journal of Human computer Interaction , 29 , 647 – 660 . Google Scholar Crossref Search ADS WorldCat Mcfarland , C. and Ross , M. ( 1987 ). The relation between current impressions and memories of self and dating partners . Personality & Social Psychology Bulletin , 13 , 228 – 238 . Google Scholar Crossref Search ADS WorldCat Mclellan , S. , Muddimer , A. and Peres , S. C. ( 2012 ). The effect of experience on system usability scale ratings . Usability Professionals' Association , 7 , 56 – 67. OpenURL Placeholder Text WorldCat Mendoza , V. and Novick , D. G. ( 2005 ) Usability over time. In Proceedings of SIGDOC 2005 , pp. 151 – 158 . New York : ACM . Miller , J. C. , Ruthig , J. C. , Bradley , A. R. , Wise , R. A. , Pedersen , H. A. and Ellison , J. M. ( 2009 ). Learning effects in the block design task: A stimulus parameter-based approach . Psychological Assessment , 21 , 570 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat Minge , M. ( 2008 ) Dynamics of User Experience . Proceedings of the NordiCHI ʼ08 conference . 1 – 55 . New York : ACM Mironshatz , T. , Stone , A. and Kahneman , D. ( 2009 ). Memories of yesterday’s emotions: Does the valence of experience affect the memory-experience gap? Emotion , 9 , 885 . Google Scholar Crossref Search ADS PubMed WorldCat Mitchell , T. R. , Thompson , L. , Peterson , E. and Cronk , R. ( 1997 ). Temporal adjustments in the evaluation of events: The “rosy view” . J Exp Soc Psychol , 33 , 421 – 448 . Google Scholar Crossref Search ADS PubMed WorldCat Moellendorff , M. W. , Hassenzahl , M. and Platz , A. ( 2006 ) Dynamics of user experience: How the perceived quality of mobile phones changes over time. In User experience—Towards a unified view. User experience - towards a unified view, workshop at the, Nordic conference on human-computer interaction, pp. 74 – 78 . Nagel , F. , Kopiez , R. , Grewe , O. and Altenmüller , E. ( 2007 ). Emujoy: Software for continuous measurement of perceived emotions in music . Behavior Research Methods , 39 , 283 – 290 . Google Scholar Crossref Search ADS PubMed WorldCat Norman , D. A. ( 2009 ). The way i see it memory is more important than actuality . Interactions , 16 , 24 – 26 . Google Scholar Crossref Search ADS WorldCat Park , J. , Han , S. H. , Kim , H. K. , Cho , Y. and Park , W. ( 2013 ). Developing elements of user experience for mobile phones and services: Survey, interview, and observation approaches . Human Factors in Ergonomics & Manufacturing , 23 , 279 – 293 . Google Scholar Crossref Search ADS WorldCat Pereira , D. R. , Costa , P. and Cerqueira , J. J. ( 2015 ). Repeated assessment and practice effects of the written symbol digit modalities test using a short inter-test interval . Archives of Clinical Neuropsychology , 30 . OpenURL Placeholder Text WorldCat Prümper , J. , Zapf , D. , Brodbeck , F. C. , and Frese , ( 1992 ). Some surprising differences between novice and expert errors in computerized office work . Behaviour & Information Technology , 11 , 319 – 328 . Google Scholar Crossref Search ADS WorldCat Qiu , D. H. ( 2001 ) Mathematical Sentics . Changsha, China : Hunan People's Publishing House . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Randall , J. G. and Villado , A. J. ( 2017 ). Take two: Sources and deterrents of score change in employment retesting . Human Resource Management Review , 27 , 536 – 553 . Google Scholar Crossref Search ADS WorldCat Registermihalik , J. K. , Kontos , D. L. , Guskiewicz , K. M. , Mihalik , J. P. , Conder , R. and Shields , E. W. ( 2012 ). Age-related differences and reliability on computerized and paper-and-pencil neurocognitive assessment batteries . J. Athl. Train , 47 , 297 – 305 . Google Scholar Crossref Search ADS PubMed WorldCat Robinson , M. D. and Clore , G. L. ( 2002 ). Episodic and semantic knowledge in emotional self-report: Evidence for two judgment processes . Journal of Personality & Social Psychology , 83 , 198 . Google Scholar Crossref Search ADS WorldCat Salthouse , T. A. and Tuckerdrob , E. M. ( 2008 ). Implications of short-term retest effects for the interpretation of longitudinal change . Neuropsychology , 22 , 800 . Google Scholar Crossref Search ADS PubMed WorldCat Sauer , J. , Seibel , K. and Rüttinger , B. ( 2010 ). The influence of user expertise and prototype fidelity in usability tests . Applied Ergonomics , 41 , 130 – 140 . Google Scholar Crossref Search ADS PubMed WorldCat Sauro , J. ( 2011 ) A Practical Guide to the System Usability Scale: Background, Benchmarks & Best Practices . Amazon. com : Create Space Independent Publishing Platform Sauro , J. and Lewis , J. R. ( 2009 ) Correlations among prototypical usability metrics:E vidence for the construct of usability. Sigchi conference on human factors in computing systems, pp. 1609 – 1618 . New York : ACM . Sauro , J. and Lewis , J. R. ( 2011 ). When designing usability questionnaires, does it hurt to be positive? Proceedings of ACM SIGCHI, pp. 2215 – 2223 . New York, NY, USA , ACM. Sauro , J. and Lewis , J. R. ( 2012 ) Quantifying the User Experience: Practical Statistics for User Research . Burlington, MA : Morgan-Kaufmann . Schäfer , T. , Zimmermann , D. and Sedlmeier , P. ( 2014 ). How we remember the emotional intensity of past musical experiences . Frontiers in Psychology , 5 , 1 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat Seo , K. K. , Lee , S. , Chung , B. D. and Park , C. ( 2015 ). Users’ emotional valence, arousal, and engagement based on perceived usability and aesthetics for web sites . International Journal of Human-Computer Interaction , 31 , 72 – 87 . Google Scholar Crossref Search ADS WorldCat Sheldon , K. M. and Lyubomirsky , S. ( 2012 ). The challenge of staying happier: Testing the hedonic adaptation prevention model . Pers Soc Psychol Bull , 38 , 670 – 680 . Google Scholar Crossref Search ADS PubMed WorldCat Sonderegger , A. and Sauer , J. ( 2010 ). The influence of design aesthetics in usability testing: Effects on user performance and perceived usability . Applied Ergonomics , 41 , 403 – 410 . Google Scholar Crossref Search ADS PubMed WorldCat Sonderegger , A. , Schmutz , S. and Sauer , J. ( 2016 ) The influence of age in usability testing . Applied Ergonomics , 52 , 291 . Google Scholar Crossref Search ADS PubMed WorldCat Sonderegger , A. , Zbinden , G. , Uebelbacher , A. and Sauer , J. ( 2012 ). The influence of product aesthetics and usability over the course of time: A longitudinal field experiment . Ergonomics , 55 , 713 – 730 . Google Scholar Crossref Search ADS PubMed WorldCat Talarico , J. M. , Labar , K. S. and Rubin , D. C. ( 2004 ). Emotional intensity predicts autobiographical memory experience . Mem Cognit , 32 , 1118 – 1132 . Google Scholar Crossref Search ADS PubMed WorldCat Yang , T. , Linder , J. and Bolchini , D. ( 2012 ). Deep: Design-oriented evaluation of perceived usability . International Journal of Human-Computer Interaction , 28 , 308 – 346 . Google Scholar Crossref Search ADS WorldCat Thielsch , M. T. , Blotenberg , I. and Jaron , R. ( 2018 ). User evaluation of websites: From first impression to recommendation . Interacting with Computers , 26 , 89 – 102 . Google Scholar Crossref Search ADS WorldCat Thomas , D. L. and Diener , E. ( 1990 ). Memory accuracy in the recall of emotions . Journal of Personality & Social Psychology , 59 , 291 – 297 . Google Scholar Crossref Search ADS WorldCat Torres-Eliard , K. , Labbé , C. and Grandjean , D. ( 2011 ). Towards a Dynamic Approach to the Study of Emotions Expressed by Music. International Conference on Intelligent Technologies for Interactive Entertainment , vol. 78 , pp. 252 – 259 . Springer Berlin Heidelberg . Google Scholar Crossref Search ADS Google Scholar Google Preview WorldCat COPAC UNE EN ISO 9241-11-1998 . Ergonomic Requirements For Office Work With Visual Display Terminals (vdts) - Part 11: Guidance On Usability Identical . Vermeeren , A. P. O. S. , Law , L. C. , Roto , V. , Obrist , M. , Hoonhout , J. and VäänänenVainioMattila , K. ( 2010 ) User experience evaluation methods: current state and development needs. Nordic Conference on Human-computer Interaction, , 521 – 530 . New York : ACM . Walker , P. , Hitch , G. J. , Dewhurst , S. A. , Whiteley , H. E. and Brandimonte , M. A. ( 1997 ). The representation of nonstructural information in visual memory: Evidence from image combination . Memory & Cognition , 25 , 484 – 491 . Google Scholar Crossref Search ADS PubMed WorldCat Widaman , K. F. , Ferrer , E. and Conger , R. D. ( 2010 ). Factorial invariance within longitudinal structural equation models: Measuring the same construct across time . Child Dev Perspect , 4 , 10 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2019. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Longitudinal Study on Retrospective Assessment of Perceived Usability: A New Method and Perspectives JF - Interacting with Computers DO - 10.1093/iwc/iwz026 DA - 2019-12-31 UR - https://www.deepdyve.com/lp/oxford-university-press/longitudinal-study-on-retrospective-assessment-of-perceived-usability-B9FwOuMTmX SP - 393 VL - 31 IS - 4 DP - DeepDyve ER -