Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Measuring Real-Time Response in Real-Life Settings

Measuring Real-Time Response in Real-Life Settings Abstract Real-time response (RTR) measurement is an important technique to assess human processing of information. Ever since Maier et al. verified the reliability and validity of physical RTR devices in this journal a decade ago, there has been a growing trend toward virtual measurement platforms to overcome the limitations of conventional, laboratory-based methods. We introduce the Debat-O-Meter, a novel online RTR platform for mobile devices, which allows researchers to measure how viewers perceive political debates in the setting of their private homes. We draw on a large study (N = 5,660) conducted during the 2017 German general election campaign and show that virtualized measurement indeed facilitates diverse and large N field studies while simultaneously conforming to established standards of reliability and validity. Introduction Real-time response (RTR) measurement is an important technique to analyze the human processing of audiovisual stimuli. Its general purpose is to enable subjects who are exposed to media content, for example, movies, music, or TV/radio programs, to provide researchers with spontaneous feedback about how they perceive the stimulus at any given time. Usually, this is done with the help of small physical devices equipped with a dial knob, a slider, or buttons that subjects can operate to convey their current impression. The stream of data provided by these devices is then stored in a database and can be graphed and statistically analyzed, both at an individual level or at an aggregated level. Combined with surveys before and after the stimulus, RTR provides a convenient way to open up the “black box” of individual information processing (Biocca, David, & West, 1994). Since its inception by Paul Lazarsfeld and Frank Stanton more than 80 years ago, RTR has gradually established itself in political and communication science research mainly as a consequence of both the emergence of presidential debates in the 1960s and 1970s, which fostered an interest in mechanisms of debate perception and the widespread availability of microcomputer technology (e.g., Boydstun, Glazier, Pietryka, & Resnik, 2014; Maurer, Reinemann, Maier, & Maier, 2007; Metz et al., 2016; Nagel, Maurer, & Reinemann, 2012; Papastefanou, 2013; Schill & Kirk, 2014; Schill, Kirk, & Jasperson, 2017). Its rise has allowed political scientists and communication researchers to gain profound insights into human perception and effects of political debates on media recipients (for an overview, see Benoit, Hansen, & Verser, 2003; Maier, Faas, & Maier, 2014; McKinney & Carlin, 2004). However, despite a tremendous growth in terms of methodological sophistication, the design of RTR studies has essentially remained unchanged and firmly wedded to laboratory-based research. Most recently, there has been a growing trend toward virtual measurement platforms to overcome the limitations of conventional, laboratory-based methods. While this novel approach facilitates field studies, the validity and reliability of virtualized RTR data are largely unexplored. To fill this gap, we will address the two research questions about whether RTR field data are reliable and internally valid. RTR Methodology: Boundaries and Benefits Despite thorough research, several problems with the approach still persist. Since the inception of RTR research, political communication researchers have exclusively relied on physical devices to capture their audiences’ reactions, meaning that the application scenario has remained virtually unchanged: Subjects are invited into a laboratory where they are surveyed, watch the debate together and provide live feedback. Only recently, researchers have begun to develop virtualized, Internet-based implementations of RTR technology, mainly to evade problems of external validity that plague the traditional approach. As such, virtualized RTR holds the promise to reduce the barriers to participation, thereby improving the size, diversity, and spatial representation of studies’ samples while simultaneously adding to findings’ robustness and generalizability. Moreover, since participants use their own mobile devices in natural reception situations (i.e., from at home), it is expected to be a cost-efficient approach as no specialized hardware is needed any more. Being software based, the instruments’ implementation certainly is also more flexible than traditional hardware by permitting a straightforward customization of the graphical user interface, measurement scaling, input mode (reset vs. latched), and the like. So far, only two attempts have been made to take on aforementioned challenges and potentials: Boydstun et al. (2014) have used a mobile web application with push buttons to survey 3,300 U.S. students, providing instant reactions to a presidential debate in 2012. Similarly, Maier, Hampe, and Jahn (2016) have equipped a small sample of 32 students with smartphones containing a preinstalled app to evaluate a TV debate of the 2013 German general election from home, using a 7-point dial implementation. While Boydstun et al. (2014) focus on the substantial results and do not explicitly report on the quality of their data, Maier, Hampe, et al. (2016) have examined their results methodologically and deem their data to be reliable and valid. While their work is an important first step, it is nonetheless still based on a small student sample and remains the only work trying to assess data quality from outside the laboratory. This article is the first to extend their work by drawing on a large, diverse set of participants from the general population to fill some of the gaps that persist. Reliability Despite its long research history, knowledge about the reliability of RTR measurement is still fragmented (Bachl, 2014; Papastefanou, 2013). Most importantly, “classical” approaches such as test–retest scenarios are difficult to reconcile with the notion of capturing spontaneous evaluations. Unsurprisingly, studies relying on test–retest procedures report somewhat inconsistent findings: While Fenwick and Rice (1991) or Hughes and Lennox (1990) find high test–retest correlations, Boyd and Hughes (1992) report coefficients of 0.53–0.64, which they assess as “low scores.” Studies based on split-half designs repeatedly measured strong scores of internal consistency. Here, Hallonquist and Suchmann (1944) report intercorrelations between 0.95 and 0.99, Hallonquist and Peatman (1947) find correlations from 0.80 up to 0.94, and Schwerin (1940) reports coefficients 0.89–0.93. In a third approach, Papastefanou (2013) recently advocated the use of Cronbach’s alpha for calculating reliability. He reports scores of internal consistency surpassing 0.90 when measuring with ambulatory RTR devices. Others have also relied on parallel-test reliability (Maier, Maurer, Reinemann, & Faas, 2007; Maier, Rittberger, & Faas, 2016; Metz et al., 2016; Reinemann, Maier, Faas, & Maurer, 2005). Comparing a push button and slider system, Maier et al. (2007) identify a significant correlation of 0.38 for the whole media stimulus, which rises up to 0.69 for certain key sequences. In addition to these findings on physical devices, only few studies examine the reliability of virtualized RTR. Metz et al. (2016) came first to provide insights by comparing a virtualized slider implementation with physical dials, finding an aggregated correlation of 0.77 for randomized groups in a controlled laboratory design. Their findings are supported by Maier, Hampe, et al. (2016) who report coefficients of parallel-test reliability above 0.51 when comparing two nonrandomized groups watching the same debate, one in a laboratory and one at home. Internal Validity Regarding internal validity, the picture is both clearer and in general more conclusive: several studies confirm that RTR data correlates with related variables in expected ways (e.g., Bachl, 2013; Maier, Hampe, et al., 2016,; Maier et al., 2007; Papastefanou, 2013; Reinemann et al., 2005). For instance, real-time evaluations have been shown to be substantially affected by party identification, an individual’s stable and affective tie to a political party (e.g., Campbell, Converse, Miller, & Stokes, 1960) that is known to color the perception and evaluation of both political candidates and issues. Similarly, post-debate judgments of contenders’ performances are strongly associated with viewers’ spontaneous real-time impressions, a relation that remains in place even after controlling for party identification (Bachl, 2013; Biocca et al., 1994; Maier et al., 2007; Papastefanou, 2013; Reinemann et al., 2005) With respect to virtualized RTR measurement, Maier, Hampe, et al. (2016) verify both construct validity by analyzing the relationship between real-time responses and party identification. They also verify criterion validity by examining the correlations between RTR scores and perceived debate winner. These correlations are confirmed by a path analysis leading to a well-known structure from previous research on televised debates in Germany (Bachl, 2013; Maier et al., 2007). External Validity Since most knowledge from RTR research derives from laboratory settings with small groups of participants using physical devices, little is known about the method’s external validity. Uncertainty comes from the fact that research has yet mostly relied on physical devices in laboratory settings (e.g., Boydstun et al., 2014; Maier, Rittberger, et al., 2016; Papastefanou, 2013; Wagschal et al., 2017). These (artificial) situations might differ considerably from natural reception situations in private surroundings, thus compromising the external validity of RTR studies (Bachl, 2014; Papastefanou, 2013). Apart from the aim to obtain a socially and spatially representative sample in the first place, several other problems known in terms of external validity (Bachl, 2014; Metz et al., 2016; Papastefanou, 2013) are not the focus of this article but might be devoted to further research. Much more, we emphasize on the reliability and validity of virtualized RTR data, which are still largely unexplored. Filling up this research gap is the main goal of this article. Research Questions and Hypotheses While leaving the laboratory holds the promise of accessing large and diverse audiences, it also entails a significant risk since researchers have no control over the environment in which participants watch the debate and, consequently, no means to ensure sufficient data quality. Essentially, this means that the question of reliability (and internal validity) of the incoming data has to be addressed. Hence, we raise our first research question (RQ1): Is RTR field data reliable? Regarding the reliability of survey data, McDonald’s omega and Cronbach’s alpha are considered established indicators in the literature. While previous work suggests that scores above 0.9 indicate a high degree of reliability, α-values surpassing 0.8 are also deemed as acceptable (Bortz & Döring 2006, p. 199). Assuming that our field data is reliable, we hypothesize that scores in our study remain above the crucial threshold of 0.8 (H1). Moreover, another issue to assess is internal validity. Here, construct and criterion validity are two established ways to investigate if data are internally valid (Bachl, 2013; Maier et al., 2007; Maier, Rittberger, et al., 2016). Since this is the first time that mobile RTR data from a large field study are tested from a methodological perspective, we ask (RQ2): Is RTR field data internally valid? In previous studies, construct validity has been operationalized by examining the association between RTR scores and party identification, leading us to ask whether we can also find such an association in field data (RQ2.1). Based on the underlying friend-or-foe logic of the social identity that resides at the heart of partisanship (Green, Palmquist, & Schickler, 2004), we assume pronounced differences in average net evaluations of participants by partisan attachment (H2) and that all partisan groups are indeed significantly distinguishable from one another in terms of their evaluations (H3). Furthermore, validity can be investigated by analyzing the association between real-time responses and external criteria, for example, the perceived debate winner, suggesting that the data should integrate other measures of debate perception (RQ2.2). Hence, we hypothesize that real-time evaluations and post-debate verdicts on the candidates’ performances correlate both significantly and substantially even after controlling for prior attitudes and other measures of debate perception (H4), thereby fulfilling the notion of criterion validity. Data and Methods Stimulus The only televised debate between the two main contenders for chancellorship in the 2017 German general election took place on the evening of September 3, 2017. Three weeks before election day, incumbent Angela Merkel of the Christian Democratic Union (CDU)/Christian Social Union (CSU) and Martin Schulz, top Social Democratic Party (SPD), discussed the most important issues in a debate that lasted more than 90 min. Around 16.5 million viewers watched the debate, which was broadcasted live on five TV stations. Sample Given the lack of experience with virtualized RTR field studies and recruitment strategies (exceptions being Boydstun et al., 2014; Wagschal et al., 2017), it is difficult to predict whether virtualized RTR can keep the promise to provide diverse and large N data. To address this issue without having to fall back on commercial samples for financial reasons, we have relied on extensive media cooperation for recruitment.1 This way we were able to attract around 45,000 visitors to our website with around 28,000 completing the pre-debate survey and entering the RTR module. Finally, nearly 15,000 users completed the whole process by remaining active in the app and filling out the post-debate survey. Since recruitment was based on an “open access” strategy to increase participation, some users may have arrived late, left early or only provided partial data. In the following, we therefore only focus on participants whose data structure comes closest to that of a regular laboratory setting. Specifically, we selected participants who had logged in and completed the pre-survey before the debate began (the login was opened 1 hr before the debate), filled out both pre- and post-surveys, did not take excessive time for completion (i.e., filled out the survey by 1:00 a.m.), resided in Germany (according to their IP address, which after the debate was replaced by an anonymous individual identifier to ensure data protection2), whose TV set had a stable playout delay (i.e., no web streams), and whose rating behavior suggested sincere, human participants.3 Since some respondents may have participated together and (e.g., by talking) thus influenced each other’s votes, there exists the possibility that some units may be interdependent (also shifting the sampling unit to the household level), which both could bias estimators. To circumvent this possibility, we also excluded individuals with a common IP identifier.4 While this does not completely rule out that some dependent cases might still remain (e.g., from individuals using their cell phone’s data packages resulting in different IP addresses), we cannot further identify them based on their IP address or their survey answers. However, we believe their number to be fairly small since it is very common for household members to share one WiFi connection, implying a shared address. Using a data package instead would require a user to actively refrain from using such a facility and rely on a potentially slower and costlier connection. Since the question of dependence among units also touches upon potentially relevant household effects, it certainly warrants further attention and we will return to it more extensively in our final conclusion at the end of the article. Our selection left us with a total of 5,660 participants5 whose demographic structure broadly follows known patterns found in online samples and reflects the general audience of our media partners by over-representing younger, male participants with a high level of education and a broad interest in politics (a detailed breakdown is available in Supplementary Table S1). However, it is also clear that without specific efforts to reach distinct population groups, all relevant demographic groups are contained in the sample. Concerning regional coverage, our results are encouraging as the share of participants from different states closely follows the real distribution pattern (the only exception being Baden-Württemberg whose over-representation stems from the project location). Also, the distribution of voting preferences comes fairly close to the distribution found in standard telephone surveys. Device and Data To investigate user’s evaluations of the TV debate, we developed a virtual RTR platform, the so-called “Debat-O-Meter,” an interdisciplinary research project of computer scientists and political scientists at the University of Freiburg. The Debat-O-Meter aims to be a full substitute and expansion of the “classical” (i.e., physical) RTR approach as it is based on mobile clients, such as smartphones, tablets or laptops. It incorporates an anonymous registration, a tutorial to introduce the user interface, the RTR module as the app’s core function, and a module to survey viewers immediately before and after the debate. Data collected through the survey or the RTR module were immediately transferred to a server and stored together with the users’ pseudonyms and a time stamp. To incentivize users to participate, a final module presented an individualized analysis of their evaluative behavior during the debate, featuring characteristics of a voting advice application. To ensure a high degree of standardization in data collection, we implemented different demands in the process: first, by implementing a phasic structure in our “virtual laboratory,” the consecutive modules of the Debat-O-Meter attends the classical RTR study concept. Second, the instructions made to the participants are regarded as a crucial point in RTR studies (Biocca et al., 1994; Maier et al., 2007). Thus, users were forced to complete an elaborated tutorial including precise instructions with respect to the media stimulus, the input mode as well as the number and specification of evaluable items and dimensions referring back to existing work in the field (Reinemann et al., 2005). The Debat-O-Meter was implemented as a push button device in reset mode with a 5-point scale ranging from double plus for very good to double minus for a very bad evaluation. If no button was pressed, a neutral evaluation was inferred. For statistical analysis, the data are recoded to a scale ranging from −2 to +2. Methods To assess reliability, we draw on McDonald’s omega and Cronbach’s alpha as two indicators of internal consistency. We then investigate internal validity in three steps: first, we analyze the net number of votes cast for each politician by all participants. Second, we assess the relation of the RTR signal to respondents’ partisanship with the help of Kruskal–Wallis rank sum tests with post hoc multiple comparisons to examine whether all partisan groups can indeed be distinguished from each other. Third, we rely on structural equation modeling to analyze how the RTR signal connects to other attitudes before, during, and after the debate, thereby assessing the criterion validity of our measures. Results Reliability When assessing the reliability of RTR measurement, the main focus in the literature rests on how internally consistent the evaluation is. That is, how well participants’ evaluations intercorrelate across the debate. One such approach has been laid out by Papastefanou (2013): its general idea is to interpret the stream of incoming votes as a test–retest scenario in which the RTR signal is taken as a longitudinal measurement that repeatedly asks the same item (i.e., how a politician is currently evaluated). If the reactions of the participants toward the politicians are captured reliably, their evaluations should not vary strongly, at least not in the short run. Thus, substantial correlations across the different quasi-items are taken as a sign that the measurement procedure can be deemed internally consistent. To assess reliability, we divided the debate into slices representing the respective candidates’ speaking phases to create the necessary quasi-items (each phase ended 5 s after the person had ceased to speak to allow for slower votes to accumulate) and continued by separately summing up positive and negative votes a participant had cast within a given interval for the respective politician. From these four participant-by-item tables, we then calculated both McDonald’s omega and Cronbach’s alpha. All in all, we ended up with 45 speaking phases for Angela Merkel and 49 for Martin Schulz.6 Referencing Bachl (2014, p. 109), we depart somewhat from the approach taken by Papastefanou (2013), who divided the debates into regular slices of 30 s each. We did so because we know from prior studies that participants nearly exclusively cast votes when a candidate is speaking, testifying to the high amount of attention devoted to verbal information (Nagel et al., 2012). Since we count no explicit reaction as a neutral evaluation, simply relying on regular slices as items would, for a given candidate, introduce arbitrary phases in which the whole audience reacts similarly to the candidate (because he or she is not speaking), thereby artificially increasing the intercorrelation of items across participants. Furthermore, we rely on two indicators of internal consistency, since alpha has recently come under criticism in the literature due to the (usually untenable) underlying assumption that all items load similarly onto the measured construct (Dunn, Baguley, & Brunsden, 2014). Omega, in turn, assumes no such restriction, allowing item loadings to vary (Kelley & Pornprasertmanit, 2016). As such, focusing on omega would already have been enough to assess reliability in a more concise manner. Yet, we opted for calculating alpha as well (including normal theory bootstrap confidence intervals as described in Padilla, Divers, & Newton, 2012), since an interesting implication of the different measurement models underlying both is that omega should score higher vis-à-vis alpha if factor loadings indeed vary across time. Since our items refer to a single measurement instruction, loadings that vary over time would signal that participants were at times more or less aware of them. By implication, any systematic difference between alpha and omega carries information on how well participants heed measurement instructions during the debate. As shown in Table 1, McDonald’s omega and Cronbach’s alpha generally suggest a high reliability for the RTR signal with regard to all participants. Usually, omega remains above 0.95 and, in all cases, it is above 0.90. A similar behavior can be observed for alpha, whose lowest value is 0.893, which is still substantially within the range generally accepted as highly reliable. Furthermore, the differences between both indicators are substantially negligible, suggesting that factor loadings are indeed constant, meaning that our participants generally adhered to measurement instructions as the debate progressed. Table 1. McDonalds’s Omega and Cronbach's Alpha Overall and by Ostensibly Partisan Evaluation Behavior (Including 95% Confidence Intervals) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) Open in new tab Table 1. McDonalds’s Omega and Cronbach's Alpha Overall and by Ostensibly Partisan Evaluation Behavior (Including 95% Confidence Intervals) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) Open in new tab Minor differences appear when we single out participants who did not cast at least one positive and one negative votes for each politician (suggesting a biased evaluation behavior) and whose number of votes during the debate exceeded +2 SD of the distribution (suggesting comparatively intensive evaluative behavior). These individuals score slightly higher in terms of both omega and alpha than the remaining participants. We found a total of 1,915 such individuals. Split up along these lines, omega and alpha distribute as shown in the respective lower rows in Table 1. As can be seen, reliability is slightly weaker for less partisan individuals. Yet, as both indicators still remain above the threshold, the drop appears too small to warrant concern. In summary, H1 appears confirmed and it seems safe to answer our first question in the affirmative: measuring the debate perception with an online RTR system can yield a reliable signal. Internal validity Laboratory-based RTR studies possess high internal validity, meaning that it is comparatively easy to connect stimulus and (RTR) response. Their main drawback, however, is the difficulty to generalize findings beyond the concrete group of participants. For mobile RTR devices, it seems that the opposite constellation is more likely. While the large and diverse number of participants should make it easier to generalize to larger audiences, it is less certain that stimulus and response can be connected with equal confidence as in laboratory settings. In other words, for mobile RTR studies, internal validity is potentially an issue. To assess internal validity, we follow two ideas laid out with respect to laboratory-based studies (Maier et al., 2007, Reinemann et al., 2005). Essentially, these ideas investigate to what extent an RTR signal is correlated with other variables involved in the process of debate reception and candidate evaluation. The first test usually investigates the extent to which the RTR signal is correlated to participants’ partisan identification, which is known to reside at the heart of many political evaluation processes and which is taken as reference concept in the assessment of construct validity (see Maier, 2007). The better the RTR signal is predicted by partisanship, the more it behaves in theoretically plausible ways and the more valid the signal appears to be. The second approach often relies on structural equation modeling to particularly assess criterion validity. It draws on the relation between the RTR signal, candidate evaluation before and after the debate, and expected/perceived candidate debate performance (see Bachl, 2013). Concerning the relation of the RTR signal to respondents’ partisanship, we first calculated the net number of votes cast for each politician for all participants, excluding those who had skipped the question on partisanship or who had mentioned “other” parties. Given the underlying friend-or-foe logic of the social identity that resides at the heart of partisanship (Green et al., 2004), it is straightforward to derive expectations on how the live evaluation should connect to respondents’ partisanship: For adherents of the two major parties (CDU/CSU and SPD), we can expect a direct tendency to positively evaluate one’s “own” candidate while rejecting the respective opponent. For the two classical coalition partners, a similar logic should lead to positive votes for the traditional “ally” and negative evaluations for the opposing block. Thus, adherents of the Free Democratic Party (FDP) should generally view Angela Merkel more positively than Schulz while being less enthusiastic than adherents of CDU/CSU. A similar pattern should apply to Green partisans for Martin Schulz and adherents of the SPD, respectively. Again, relations should be reversed for the respective other side’s candidate. This leaves us with adherents of the left-wing camp, “DIE LINKE,” of the right-wing party, Alternative for Germany (AfD), and unattached viewers to account for. While adherents of Linke may partly see Angela Merkel in a positive light due to her decisions during the refugee crisis in 2015, a more general pattern would seem to be a favorable evaluation of Schulz given the common political camp and his perceived openness to a coalition including Linke. Concerning the AfD whose formation was largely driven as opposition to Merkel, we expect a strong hostility toward her plus a general dislike for Schulz, partly because the candidate is from the left camp and partly because part of the AfD’s self-legitimization rests on a critique of established parties. Unattached participants, in turn, should generally fall outside the friend-or-foe logic of the debate and show a more balanced evaluation. Table 2 shows the average net evaluations for participants with different partisan attachment (the associated Eta squares are 0.108 for Merkel and 0.130 for Schulz). For better comparison, we have also calculated the net average value across all participants. As we can directly see, the values correspond to our theoretical expectations as adherents of CDU/CSU and SPD show a decidedly positive attitude toward their “own” candidate (50.6 for Merkel and 64.4 for Schulz, respectively) and a clearly negative attitude toward their “enemy” (−10.3 for Schulz vs. −11.0 for Merkel). The same is true for adherents of likely coalition partners—here, the Greens show a weak negative (i.e., below average) tendency toward Merkel while reacting clearly positively toward Schulz. The same is visible for FDP adherents, albeit reversed. For Linke, we can observe a clear preference for Schulz and (even stronger than expected) a negative stance toward Merkel. The most negative reaction can be found for AfD adherents who, as expected, show a strongly negative evaluation of Merkel while being largely neutral toward Schulz. Respondents without an attachment come to lie fairly close to the distribution for the whole sample, indicating that their evaluation is indeed more open. While ideally we would like to compare the relation to laboratory-based data, such information is unavailable for the 2017 debate at the time of writing. However, similar patterns of support and rejection within and across camps can also be found in parts of the literature albeit only reported for vote intentions (e.g., Bachl, 2013, p. 144) or aggregated partisan groups (Jansen & Glogger, 2017, p. 49/51). Yet they also replicate largely with respect to candidate evaluations both within our own and external survey data (scaled to a similar metric and demeaned for easier comparison, see Table 2), further bolstering our confidence in the measurement. Since the evaluation behavior of the respective partisan groups corresponds to our theoretical assumptions, we can confirm H2. Table 2. Average Net Evaluations for Participants by Partisan Attachment . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 Note. Mean RTR scores were calculated by adding positive (“++” = +2; “+” = +1; “−” = −1; “−−” = −2) and negative votes for each participant across the debate. The resulting individual net scores were then averaged across partisan groups. Mean evaluation before the debate (possible range: +2 to −2) was calculated separately for each partisan group first and then demeaned by subtracting the average evaluation across all participants. Mean evaluation from Politbarometer was calculated accordingly after rescaling the original item (+5 to −5) to a scale from +2 to −2. CDU = Christian Democratic Union; PID = Partyidentification; RTR = real-time response; SPD = Social Democratic Party. Open in new tab Table 2. Average Net Evaluations for Participants by Partisan Attachment . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 Note. Mean RTR scores were calculated by adding positive (“++” = +2; “+” = +1; “−” = −1; “−−” = −2) and negative votes for each participant across the debate. The resulting individual net scores were then averaged across partisan groups. Mean evaluation before the debate (possible range: +2 to −2) was calculated separately for each partisan group first and then demeaned by subtracting the average evaluation across all participants. Mean evaluation from Politbarometer was calculated accordingly after rescaling the original item (+5 to −5) to a scale from +2 to −2. CDU = Christian Democratic Union; PID = Partyidentification; RTR = real-time response; SPD = Social Democratic Party. Open in new tab To further explore the validity of the relation, we also perform Kruskal–Wallis rank sum tests with post hoc multiple comparisons to determine whether all partisan groups can indeed be distinguished from each other. Both global tests were significant at p < 0.0001, and nearly all groups were distinguishable in the post hoc comparisons as a significance level of 0.05. The only exceptions were participants without a partisanship and adherents of the Greens in the case of Merkel, which is not too surprising given that the party appears open to coalitions with both major parties. For Schulz, the tests failed to report a distinction between adherents of Greens versus Linke, Linke versus SPD, and AfD versus individuals without a partisanship. The two former findings are comparatively narrow misses and appear reasonable given the common camp and Schulz’ openness for coalitions with Linke, and the latter may well stem from the dominant motive of dislike for Merkel on behalf of AfD adherents. Considering the valid statistical discrimination of the respective partisan groups, we are thus able to confirm H3. As such partisanship quite robustly predicts real-time evaluation, it leads us to answer question 2.1 in the affirmative, as well. A final piece of evidence can be generated by examining how the RTR signal connects to other immediate attitudes about the debate. In the literature, a common approach to this question can be found in structural equation models that connect attitudes before, during, and after the debate. Since we are assuming no latent variables, a path model is sufficient and our main focus rests on the question whether known structures from classical laboratory-based studies can be generated from our data. As a benchmark for comparison, we can draw on the work by Bachl (2013). In his model, perceived performance during the debate (as captured by the RTR signal) is affected both by prior candidate evaluation and expected debate performance. In turn, all three variables govern how the candidate performance is assessed after the debate that finally affects overall candidate evaluation. In addition, candidate evaluation after the debate rests on live debate performance and prior candidate evaluations. We have reproduced the path structure from Bachl’s model with our data and tried to follow his operationalization as close as possible to allow for comparison (see p. 177 for his operationalization).7 Again, both models confirm the general picture of valid upstream and downstream relations for the RTR signal. It should be noted that the model for Schulz does not pass the likelihood-ratio test, which seems to be of lesser relevance, though, as the test has been shown to regularly reject models with a large number of cases (Weiber & Mühlhaus, 2014, p. 204). Both root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR) statistics that take model structure into account and all three available comparative fit indices (comparative fit index; CFI, TFI and normed fit index; NFI) in turn indicate a good model fit (for Merkel: RMSEA = 0.029 [95% CI: 0.009, 0.054], pClose = 0.908, SRMR = 0.018, CFI = 1.000, TFI = 0.997, NFI: 1.000; for Schulz: RMSEA = 0.015 [95% CI: 0.000, 0.043], pClose = 0.985, SRMR = 0.025, CFI = 1.000, TFI = 0.999, NFI = 1.000). The models presented in Figure 1 show both standardized coefficients and the associated R2 values. Figure 1. Open in new tabDownload slide Structural equation model—Angela Merkel and Martin Schulz. Figure 1. Open in new tabDownload slide Structural equation model—Angela Merkel and Martin Schulz. Both models indicate that candidate evaluation before the debate has a substantial influence on who is expected to win. The higher a candidate is esteemed, the more likely a participant expects the candidate to decide the debate for him- or herself. The effect of expected performance on perceived performance after the debate is still noticeable but much less pronounced. The same holds true for prior evaluation. In both models, immediate debate perception strongly affects how the candidate was seen to perform in retrospect. As such, political leanings may affect expectations but they are not able to undo actual perception. Considering that correlations of real-time evaluations and post-debate verdicts on the candidates’ performances are significant and substantial even after controlling for subsequent variables of debate reception, we can state criterion validity to be given for our data and, thus, we are able to confirm H4 as well. Interestingly, however, the effect of prior evaluation on real-time perception is substantial, indicating that respondents are strongly inclined to view candidate debate performance through a colored lens. Proceeding to post-debate candidate evaluation, there is a noticeable direct effect of post-debate performance (in turn shaped by the RTR signal) accompanied by a substantial direct effect coming from how the candidate was evaluated during the discussion. By and large, the relative sizes of the coefficients are comparable to similar models in the literature for laboratory-based settings (see Bachl, 2013, p. 184), suggesting that our measurement yields comparable results. Furthermore, that actual debate performance (RTR signal) exerts a comparably larger influence on downstream variables and the lesser effect of prior evaluations on post-debate performance in the case of Schulz nicely fits the idea that his image as challenger was not as solidified as that of incumbent Merkel. We can therefore also answer our question 2.2 in the affirmative. Conclusion In this article, we have tried to establish the reliability and validity of virtualized RTR measurements outside the laboratory. While existing work has already shown that, in laboratory settings, both physical and virtual devices perform similarly (Metz et al., 2016) and that laboratory and virtual settings generally correlate as expected (Maier, Hampe, et al., 2016), the question remained whether research can also rely on data obtained from home-recorded measurements on a larger scale. Given the setting and the impossibility to interact with the participants, the question is crucial when real-time response measurements is extended to bigger, more diverse audiences across larger geographical areas. To address this issue, we relied on the strategy to assess the reliability and validity of a virtualized measurement in the light of known evidence from laboratory-based settings. To do so, we have drawn on RTR data gathered in the context of the major TV debate between the two chancellor candidates for the 2017 German general election, Angela Merkel and Martin Schulz. This dataset is the largest collection of RTR data available today. By focusing on participants who have fully followed the study protocol, we have singled out those who come closest to laboratory participants to obtain the optimal grounds for comparison. We are quite confident that mobile RTR may provide diverse and large N data and that recruiting participants through media partnerships seems a viable and cost-effective option. For some participant attributes, such as regional coverage, gender, and vote choice, the sample we obtained was already quite satisfactory. Yet we also noticed that the sample tendency from laboratory settings, that is, a concentration of individuals who are more interested in politics, more educated, and in part, generally younger, was also present in our data. While the sheer number of individuals in principle allows us to look into details such as “rare” groups and offers the possibility of post-stratification, it still seems important to motivate the remaining groups to join as well to further improve data quality. In the end, however, these problems all appear manageable especially when the financial means to resort to approved methods of recruitment, such as telephone surveys or ready-made panels, are in place, but even if such financial sources were present, the fact that all relevant population groups were already featured in our data suggests that established recruitment practices might not be necessary, given a suitable re-weighting scheme. In terms of reliability, we can positively state that our data are highly consistent internally and that this consistency qualifies in the ranges known from laboratory studies. Accordingly, switching to a virtualized environment does not seem to cause substantial problems in this respect. The same conclusion can be drawn when assessing construct and criterion validity: Across all groups, partisanship connected to RTR evaluations associates as expected, meaning that it is generally possible to step outside the laboratory without a larger risk to compromise internal validity. This connection is so stable that nearly all of the partisan groups could readily be distinguished in post hoc tests. Going into the up- and downstream relations of live debate perception, we also found that established models for laboratory settings held true for our participants. We therefore conclude that mobile RTR data possess an internal validity comparable to laboratory-based studies. Another important aspect is that our sample of individuals is restricted to those who fully obeyed the experimental protocol—insofar we cannot say much about those who, for instance, joined late, switched to other channels, lost interest, failed to take the situation seriously, etc. Yet, given that our results come so close to laboratory-based evidence, we have good reasons to be confident that most of the findings should also extend to the rest of our participants. Another issue to think about is that of dependence among units. Within a laboratory setting, it is fairly straightforward to not only select participants individually but also to tell them not to interact, thus providing a standardized environment. At home, the situation is different. Researchers have no direct control over whether respondents watch together and whether they exchange opinions, discuss or even argue. This potentially introduces patterns of dependence among units that (as a limitation of our current work) we can only partly account for when we measure over the Internet. There seem to be two ways to react to this challenge: One would be to try and increase standardization by, for example, explicitly asking participants not to talk during the debate, or maybe even to watch alone. Done right, this can potentially yield results coming close to that of laboratory situations and help to extend a known and understood setting into the population. However, alternatively, one may also try to monitor the context in which the debate was watched and use it as an ex post control or even as an independent variable. Did the respondent watch together with others? Did the respondent talk to these persons? Were they of the same opinion? This would allow to actively model the effects of interdependence and interaction on the perception of televised debates, something that is currently hardly understood. In this article, we have asked whether it is safe to rely on RTR data gathered in virtualized settings outside the laboratory. In summary, we find that the answer to our question is a quite consistent “yes.” RTR data collected with the help of virtualized devices over the Internet appear reliable and valid and behave much the same way that data gathered in a laboratory does. Arising differences can readily be explained as implications of our existing models and thus do not pose problems of data reliability or validity. Therefore, it seems possible to extend RTR research to large audiences at home without compromising the study objectives. As such, smartphones and other devices with an Internet connection open up a new way to explore how large-scale audiences perceive political debates in the confines of their private homes. While laboratory-based studies still have advantages and justifications, we have shown that they can well be supplemented with alternative ways of measurement, which allows research to address new innovative questions. In a sense, the door is open. Supplementary Data Supplementary Data are available at IJPOR online. Biographical Note Thomas Waldvogel is a research associate at the Department of Political Science, Albert-Ludwigs University of Freiburg, Germany. His research focuses on political communication, election campaigns and political education. Thomas Metz is a research associate at the Department of Political Science, Albert-Ludwigs University of Freiburg, Germany. His research focuses on election studies, peace and conflict studies and statistical methods in political science. Footnotes 1 We recruited over 90 regional and national newspapers and online media, including Süddeutsche Zeitung (circulation 350,000), Focus online (23.7 million users), Redaktionsnetzwerk Deutschland (providing content for about 40 newspapers), and Sat.1 national TV. In exchange for coverage of the project and calls for participation, media partners were offered interviews and articles to support coverage plus a preliminary analysis directly after the debate. Participants received a customized analysis of their evaluative behavior plus a shortened version of the media partners’ analysis. 2 The Debat-O-Meter is implemented with an hypertext transfer protocol secure protocol for privacy protection. IPs were automatically converted to consecutive numbers, and the original addresses were deleted after 24 hr. We also recorded participants’ user agent header for information on the users’ devices (PC or mobile). 3 To prevent abuse, the Debat-O-Meter uses measures such as Captchas to distinguish human users from automated scripts and monitors user behavior in real time to spot suspicious behavior. All in all, suspicious users only made up a small part of the sample. 4 This choice eliminated 478 individuals sharing 228 IP addresses. Most of these (210 addresses) were two people of comparable age and opposite sex, suggesting that they were couples. 5 The lion’s share of the reduction comes from participants outside Germany and users who joined later during the debate. 6 One speaking phase for Merkel had to be discarded because it lasted only for one second (a brief “yes” to a minor question) and elicited no reaction from the audience. 7 Candidate evaluation is an overall assessment between −2 (very low) and +2 (very high), and debate performance is measured as the mean RTR evaluation (ranging from −2 to +2). While Bachl used separate items, we measured pre-/postdebate performance by recoding expected/perceived status as debate winner. Respondents who expected/perceived Merkel to win were coded +1, those expecting/perceiving a draw were given a value of 0, and an expected/perceived victory for Schulz was coded as −1. We used the same scheme for Schulz. Given our coding, assumptions for normal distribution could not be sustained, so we estimated all models with robust standard errors. Acknowledgment This article is our own work, but it would not have been possible without support and feedback from the rest of the Debat-O-Meter team, which we greatly appreciate. References Bachl M. ( 2013 ). Die Wahrnehmung des TV-Duells. In Bachl M., Brettschneider F., Ottler S. (Eds.), Das TV-Duell in Baden-Württemberg 2011 (pp. 135 – 169 ). Wiesbaden : Springer VS . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Bachl M. ( 2014 ). Analyse rezeptionsbegleitend gemessener Kandidatenbewertungen in TV-Duellen. (PhD thesis, Universität Hohenheim). Benoit W. L. , Hansen G. J., Verser R. M. ( 2003 ). A meta-analysis of the effects viewing U.S. presidential debates . Communication Monographs , 70 ( 4 ), 335 – 350 . Google Scholar Crossref Search ADS WorldCat Biocca F. , David P., West M. ( 1994 ). Continuous response measurement (CRM). In Lang A. (Ed.), Measuring psychological responses to media messages (pp. 15 – 64 ). Hillsdale : Erlbaum . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Bortz J. , Döring N. ( 2006 ). Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler . Heidelberg : Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Boyd T. C. , Hughes G. D. ( 1992 ). Validating realtime response measures. In J. F. Jr. Sherry, & B. Sternthal (Eds.), NA - Advances in Consumer Research (Vol. 19, pp. 649–656). Provo, Utah: Association for Consumer Research. Boydstun A. E. , Glazier R. A., Pietryka M. T., Resnik P. ( 2014 ). Real-time reactions to a 2012 presidential debate . Public Opinion Quarterly , 78 , 330 – 343 . Google Scholar Crossref Search ADS WorldCat Campbell A. , Converse P., Miller W., Stokes D. ( 1960 ). The American voter . New York : Wiley . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Dunn T. , Baguley T., Brunsden V. ( 2014 ). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation . British Journal of Psychology , 105 ( 3 ), 399 – 412 . Google Scholar Crossref Search ADS PubMed WorldCat Fenwick I. , Rice M. D. ( 1991 ). Reliability of continuous measurement copy-testing methods . Journal of Advertising Research , 31 ( 1 ), 23 – 29 . Google Scholar OpenURL Placeholder Text WorldCat Green D. P. , Palmquist B., Schickler E. ( 2004 ). Partisan hearts and minds: Political parties and the social identities of voters . New Haven : Yale University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hallonquist T. , Peatman J. G. ( 1947 ). Diagnosing your radio program, or the program analyzer at work. In Institute for Education by Radio (Ed.), Education on the air. Yearbook of the Institute for Education by Radio (pp. 463 – 474 ). Columbus, OH: Ohio State University Press. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hallonquist T. , Suchmann E. E. ( 1944 ). Listening to the listener. Experiences with the Lazarsfeld-Stanton program analyzer. In Lazarsfeld P. F., Stanton F. (Eds.), Radio research 1942–1943 (pp. 265 – 334) . New York : Arno Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hughes G. D. , Lennox R. ( 1990 ). Realtime response research: Construct validation and reliability assessment. In Bearden W., Parasuraman A. (Eds.), Enhancing knowledge development in marketing (pp. 284 – 288 ). Chicago : American Marketing Association . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Jansen C. , Glogger I. ( 2017 ). Von Schachteln im Schaufenster, Kreisverkehren und (keiner) PKW-Maut. In Faas T., Maier J., Maier M. (Eds.), Merkel gegen Steinbrück (pp. 31 – 58 ). Wiesbaden : Springer VS . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Kelley K. , Pornprasertmanit S. ( 2016 ). Confidence intervals for population reliability coefficients . Psychological Methods , 21 ( 1 ), 69 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat Maier J. ( 2007 ). Erfolgreiche Überzeugungsarbeit. Urteile über den Debattensieger und die Veränderung der Kanzlerpräferenz. In Maurer M., Reinemann C., Maier J., Maier M. (Eds.), Schröder gegen Merkel (pp. 91 – 109) . Wiesbaden : VS Verlag . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Maier J. , Faas T., Maier M. ( 2014 ). Aufgeholt, aber nicht aufgeschlossen: Ausgewählte Befunde zur Wahrnehmung und Wirkung des TV-Duells 2013 zwischen Angela Merkel und Peer Steinbrück . Zeitschrift für Parlamentsfragen , 45 ( 1 ), 38 – 54 . Google Scholar Crossref Search ADS WorldCat Maier J. , Hampe J. F., Jahn N. ( 2016 ). Breaking out of the lab: Measuring real-time responses to televised political content in real-world settings . Public Opinion Quarterly , 80 ( 2 ), 542 – 553 . Google Scholar Crossref Search ADS PubMed WorldCat Maier J. , Maurer M., Reinemann C., Faas T. ( 2007 ). Reliability and validity of real-time response measurement: A comparison of two studies of a televised debate in Germany . International Journal of Public Opinion Research , 19 ( 1 ), 53 – 73 . Google Scholar Crossref Search ADS WorldCat Maier J. , Rittberger B., Faas T. ( 2016 ). Debating Europe: Effects of the “Eurovision Debate” on EU attitudes of young German voters and the moderating role played by political involvement . Politics and Governance , 4 ( 1 ), 55 – 68 . Google Scholar Crossref Search ADS WorldCat Maurer M. , Reinemann C., Maier J., Maier M. (Hrsg.). ( 2007 ). Schröder gegen Merkel . Wiesbaden : VS Verlag . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC McKinney M. , Carlin D. ( 2004 ). Political campaign debates. In Kaid L. L. (Ed.), Handbook of political communication research (pp. 203 – 234) . Mahwah : Lawrence Erlbaum . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Metz T. , Wagschal U., Waldvogel T., Bachl M., Feiten L., Becker B. ( 2016 ). Das Debat-O-Meter . Zeitschrift für Staats- und Europawissenschaften , 14 ( 1 ), 124 – 149 . Google Scholar Crossref Search ADS WorldCat Nagel F. , Maurer M., Reinemann C. ( 2012 ). Is there a visual dominance in political communication? Journal of Communication , 62 ( 5 ), 833 – 850 . Google Scholar Crossref Search ADS WorldCat Padilla M. , Divers J., Newton M. ( 2012 ). Coefficient alpha bootstrap confidence interval under nonnormality . Applied Psychological Measurement , 36 ( 5 ), 331 – 348 . Google Scholar Crossref Search ADS WorldCat Papastefanou G. ( 2013 ). Reliability and validity of RTR measurement device. Gesis. Leibniz-Institut für Sozialwissenschaften. Working Paper 2013-27. Reinemann C. , Maier J., Faas T., Maurer M. ( 2005 ). Reliabilität und Validität von RTR-Messungen . Publizistik , 50 ( 1 ), 56 – 73 . Google Scholar Crossref Search ADS WorldCat Schill D. , Kirk R. ( 2014 ). Courting the swing voter: “Real time” insights into the 2008 and 2012 U.S. presidential debates . American Behavioral Scientist , 58 ( 4 ), 536 – 555 . Google Scholar Crossref Search ADS WorldCat Schill D. , Kirk R., Jasperson A. E. ( 2017 ). Political communication in real time. New York : Routledge . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Schwerin H. ( 1940 ). An exploratory study of the reliability of the “program analyzer” . Journal of Applied Psychology , 24 ( 6 ), 742 – 745 . Google Scholar Crossref Search ADS WorldCat Wagschal U. , Waldvogel T., Metz T., Becker B., Feiten L., Weishaupt S., Singh K. ( 2017 ). Das TV-Duell und die Landtagswahl in Schleswig-Holstein . Zeitschrift für Parlamentsfragen , 48 ( 3 ), 594 – 613 . Google Scholar Crossref Search ADS WorldCat Weiber R. , Mühlhaus D. ( 2014 ). Strukturgleichungsmodellierung . Heidelberg : Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC © The Author(s) 2020. Published by Oxford University Press on behalf of The World Association for Public Opinion Research. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Public Opinion Research Oxford University Press

Measuring Real-Time Response in Real-Life Settings

Loading next page...
 
/lp/oxford-university-press/measuring-real-time-response-in-real-life-settings-iy1Lf08Rfx

References (22)

Publisher
Oxford University Press
Copyright
© The Author(s) 2020. Published by Oxford University Press on behalf of The World Association for Public Opinion Research. All rights reserved.
ISSN
0954-2892
eISSN
1471-6909
DOI
10.1093/ijpor/edz050
Publisher site
See Article on Publisher Site

Abstract

Abstract Real-time response (RTR) measurement is an important technique to assess human processing of information. Ever since Maier et al. verified the reliability and validity of physical RTR devices in this journal a decade ago, there has been a growing trend toward virtual measurement platforms to overcome the limitations of conventional, laboratory-based methods. We introduce the Debat-O-Meter, a novel online RTR platform for mobile devices, which allows researchers to measure how viewers perceive political debates in the setting of their private homes. We draw on a large study (N = 5,660) conducted during the 2017 German general election campaign and show that virtualized measurement indeed facilitates diverse and large N field studies while simultaneously conforming to established standards of reliability and validity. Introduction Real-time response (RTR) measurement is an important technique to analyze the human processing of audiovisual stimuli. Its general purpose is to enable subjects who are exposed to media content, for example, movies, music, or TV/radio programs, to provide researchers with spontaneous feedback about how they perceive the stimulus at any given time. Usually, this is done with the help of small physical devices equipped with a dial knob, a slider, or buttons that subjects can operate to convey their current impression. The stream of data provided by these devices is then stored in a database and can be graphed and statistically analyzed, both at an individual level or at an aggregated level. Combined with surveys before and after the stimulus, RTR provides a convenient way to open up the “black box” of individual information processing (Biocca, David, & West, 1994). Since its inception by Paul Lazarsfeld and Frank Stanton more than 80 years ago, RTR has gradually established itself in political and communication science research mainly as a consequence of both the emergence of presidential debates in the 1960s and 1970s, which fostered an interest in mechanisms of debate perception and the widespread availability of microcomputer technology (e.g., Boydstun, Glazier, Pietryka, & Resnik, 2014; Maurer, Reinemann, Maier, & Maier, 2007; Metz et al., 2016; Nagel, Maurer, & Reinemann, 2012; Papastefanou, 2013; Schill & Kirk, 2014; Schill, Kirk, & Jasperson, 2017). Its rise has allowed political scientists and communication researchers to gain profound insights into human perception and effects of political debates on media recipients (for an overview, see Benoit, Hansen, & Verser, 2003; Maier, Faas, & Maier, 2014; McKinney & Carlin, 2004). However, despite a tremendous growth in terms of methodological sophistication, the design of RTR studies has essentially remained unchanged and firmly wedded to laboratory-based research. Most recently, there has been a growing trend toward virtual measurement platforms to overcome the limitations of conventional, laboratory-based methods. While this novel approach facilitates field studies, the validity and reliability of virtualized RTR data are largely unexplored. To fill this gap, we will address the two research questions about whether RTR field data are reliable and internally valid. RTR Methodology: Boundaries and Benefits Despite thorough research, several problems with the approach still persist. Since the inception of RTR research, political communication researchers have exclusively relied on physical devices to capture their audiences’ reactions, meaning that the application scenario has remained virtually unchanged: Subjects are invited into a laboratory where they are surveyed, watch the debate together and provide live feedback. Only recently, researchers have begun to develop virtualized, Internet-based implementations of RTR technology, mainly to evade problems of external validity that plague the traditional approach. As such, virtualized RTR holds the promise to reduce the barriers to participation, thereby improving the size, diversity, and spatial representation of studies’ samples while simultaneously adding to findings’ robustness and generalizability. Moreover, since participants use their own mobile devices in natural reception situations (i.e., from at home), it is expected to be a cost-efficient approach as no specialized hardware is needed any more. Being software based, the instruments’ implementation certainly is also more flexible than traditional hardware by permitting a straightforward customization of the graphical user interface, measurement scaling, input mode (reset vs. latched), and the like. So far, only two attempts have been made to take on aforementioned challenges and potentials: Boydstun et al. (2014) have used a mobile web application with push buttons to survey 3,300 U.S. students, providing instant reactions to a presidential debate in 2012. Similarly, Maier, Hampe, and Jahn (2016) have equipped a small sample of 32 students with smartphones containing a preinstalled app to evaluate a TV debate of the 2013 German general election from home, using a 7-point dial implementation. While Boydstun et al. (2014) focus on the substantial results and do not explicitly report on the quality of their data, Maier, Hampe, et al. (2016) have examined their results methodologically and deem their data to be reliable and valid. While their work is an important first step, it is nonetheless still based on a small student sample and remains the only work trying to assess data quality from outside the laboratory. This article is the first to extend their work by drawing on a large, diverse set of participants from the general population to fill some of the gaps that persist. Reliability Despite its long research history, knowledge about the reliability of RTR measurement is still fragmented (Bachl, 2014; Papastefanou, 2013). Most importantly, “classical” approaches such as test–retest scenarios are difficult to reconcile with the notion of capturing spontaneous evaluations. Unsurprisingly, studies relying on test–retest procedures report somewhat inconsistent findings: While Fenwick and Rice (1991) or Hughes and Lennox (1990) find high test–retest correlations, Boyd and Hughes (1992) report coefficients of 0.53–0.64, which they assess as “low scores.” Studies based on split-half designs repeatedly measured strong scores of internal consistency. Here, Hallonquist and Suchmann (1944) report intercorrelations between 0.95 and 0.99, Hallonquist and Peatman (1947) find correlations from 0.80 up to 0.94, and Schwerin (1940) reports coefficients 0.89–0.93. In a third approach, Papastefanou (2013) recently advocated the use of Cronbach’s alpha for calculating reliability. He reports scores of internal consistency surpassing 0.90 when measuring with ambulatory RTR devices. Others have also relied on parallel-test reliability (Maier, Maurer, Reinemann, & Faas, 2007; Maier, Rittberger, & Faas, 2016; Metz et al., 2016; Reinemann, Maier, Faas, & Maurer, 2005). Comparing a push button and slider system, Maier et al. (2007) identify a significant correlation of 0.38 for the whole media stimulus, which rises up to 0.69 for certain key sequences. In addition to these findings on physical devices, only few studies examine the reliability of virtualized RTR. Metz et al. (2016) came first to provide insights by comparing a virtualized slider implementation with physical dials, finding an aggregated correlation of 0.77 for randomized groups in a controlled laboratory design. Their findings are supported by Maier, Hampe, et al. (2016) who report coefficients of parallel-test reliability above 0.51 when comparing two nonrandomized groups watching the same debate, one in a laboratory and one at home. Internal Validity Regarding internal validity, the picture is both clearer and in general more conclusive: several studies confirm that RTR data correlates with related variables in expected ways (e.g., Bachl, 2013; Maier, Hampe, et al., 2016,; Maier et al., 2007; Papastefanou, 2013; Reinemann et al., 2005). For instance, real-time evaluations have been shown to be substantially affected by party identification, an individual’s stable and affective tie to a political party (e.g., Campbell, Converse, Miller, & Stokes, 1960) that is known to color the perception and evaluation of both political candidates and issues. Similarly, post-debate judgments of contenders’ performances are strongly associated with viewers’ spontaneous real-time impressions, a relation that remains in place even after controlling for party identification (Bachl, 2013; Biocca et al., 1994; Maier et al., 2007; Papastefanou, 2013; Reinemann et al., 2005) With respect to virtualized RTR measurement, Maier, Hampe, et al. (2016) verify both construct validity by analyzing the relationship between real-time responses and party identification. They also verify criterion validity by examining the correlations between RTR scores and perceived debate winner. These correlations are confirmed by a path analysis leading to a well-known structure from previous research on televised debates in Germany (Bachl, 2013; Maier et al., 2007). External Validity Since most knowledge from RTR research derives from laboratory settings with small groups of participants using physical devices, little is known about the method’s external validity. Uncertainty comes from the fact that research has yet mostly relied on physical devices in laboratory settings (e.g., Boydstun et al., 2014; Maier, Rittberger, et al., 2016; Papastefanou, 2013; Wagschal et al., 2017). These (artificial) situations might differ considerably from natural reception situations in private surroundings, thus compromising the external validity of RTR studies (Bachl, 2014; Papastefanou, 2013). Apart from the aim to obtain a socially and spatially representative sample in the first place, several other problems known in terms of external validity (Bachl, 2014; Metz et al., 2016; Papastefanou, 2013) are not the focus of this article but might be devoted to further research. Much more, we emphasize on the reliability and validity of virtualized RTR data, which are still largely unexplored. Filling up this research gap is the main goal of this article. Research Questions and Hypotheses While leaving the laboratory holds the promise of accessing large and diverse audiences, it also entails a significant risk since researchers have no control over the environment in which participants watch the debate and, consequently, no means to ensure sufficient data quality. Essentially, this means that the question of reliability (and internal validity) of the incoming data has to be addressed. Hence, we raise our first research question (RQ1): Is RTR field data reliable? Regarding the reliability of survey data, McDonald’s omega and Cronbach’s alpha are considered established indicators in the literature. While previous work suggests that scores above 0.9 indicate a high degree of reliability, α-values surpassing 0.8 are also deemed as acceptable (Bortz & Döring 2006, p. 199). Assuming that our field data is reliable, we hypothesize that scores in our study remain above the crucial threshold of 0.8 (H1). Moreover, another issue to assess is internal validity. Here, construct and criterion validity are two established ways to investigate if data are internally valid (Bachl, 2013; Maier et al., 2007; Maier, Rittberger, et al., 2016). Since this is the first time that mobile RTR data from a large field study are tested from a methodological perspective, we ask (RQ2): Is RTR field data internally valid? In previous studies, construct validity has been operationalized by examining the association between RTR scores and party identification, leading us to ask whether we can also find such an association in field data (RQ2.1). Based on the underlying friend-or-foe logic of the social identity that resides at the heart of partisanship (Green, Palmquist, & Schickler, 2004), we assume pronounced differences in average net evaluations of participants by partisan attachment (H2) and that all partisan groups are indeed significantly distinguishable from one another in terms of their evaluations (H3). Furthermore, validity can be investigated by analyzing the association between real-time responses and external criteria, for example, the perceived debate winner, suggesting that the data should integrate other measures of debate perception (RQ2.2). Hence, we hypothesize that real-time evaluations and post-debate verdicts on the candidates’ performances correlate both significantly and substantially even after controlling for prior attitudes and other measures of debate perception (H4), thereby fulfilling the notion of criterion validity. Data and Methods Stimulus The only televised debate between the two main contenders for chancellorship in the 2017 German general election took place on the evening of September 3, 2017. Three weeks before election day, incumbent Angela Merkel of the Christian Democratic Union (CDU)/Christian Social Union (CSU) and Martin Schulz, top Social Democratic Party (SPD), discussed the most important issues in a debate that lasted more than 90 min. Around 16.5 million viewers watched the debate, which was broadcasted live on five TV stations. Sample Given the lack of experience with virtualized RTR field studies and recruitment strategies (exceptions being Boydstun et al., 2014; Wagschal et al., 2017), it is difficult to predict whether virtualized RTR can keep the promise to provide diverse and large N data. To address this issue without having to fall back on commercial samples for financial reasons, we have relied on extensive media cooperation for recruitment.1 This way we were able to attract around 45,000 visitors to our website with around 28,000 completing the pre-debate survey and entering the RTR module. Finally, nearly 15,000 users completed the whole process by remaining active in the app and filling out the post-debate survey. Since recruitment was based on an “open access” strategy to increase participation, some users may have arrived late, left early or only provided partial data. In the following, we therefore only focus on participants whose data structure comes closest to that of a regular laboratory setting. Specifically, we selected participants who had logged in and completed the pre-survey before the debate began (the login was opened 1 hr before the debate), filled out both pre- and post-surveys, did not take excessive time for completion (i.e., filled out the survey by 1:00 a.m.), resided in Germany (according to their IP address, which after the debate was replaced by an anonymous individual identifier to ensure data protection2), whose TV set had a stable playout delay (i.e., no web streams), and whose rating behavior suggested sincere, human participants.3 Since some respondents may have participated together and (e.g., by talking) thus influenced each other’s votes, there exists the possibility that some units may be interdependent (also shifting the sampling unit to the household level), which both could bias estimators. To circumvent this possibility, we also excluded individuals with a common IP identifier.4 While this does not completely rule out that some dependent cases might still remain (e.g., from individuals using their cell phone’s data packages resulting in different IP addresses), we cannot further identify them based on their IP address or their survey answers. However, we believe their number to be fairly small since it is very common for household members to share one WiFi connection, implying a shared address. Using a data package instead would require a user to actively refrain from using such a facility and rely on a potentially slower and costlier connection. Since the question of dependence among units also touches upon potentially relevant household effects, it certainly warrants further attention and we will return to it more extensively in our final conclusion at the end of the article. Our selection left us with a total of 5,660 participants5 whose demographic structure broadly follows known patterns found in online samples and reflects the general audience of our media partners by over-representing younger, male participants with a high level of education and a broad interest in politics (a detailed breakdown is available in Supplementary Table S1). However, it is also clear that without specific efforts to reach distinct population groups, all relevant demographic groups are contained in the sample. Concerning regional coverage, our results are encouraging as the share of participants from different states closely follows the real distribution pattern (the only exception being Baden-Württemberg whose over-representation stems from the project location). Also, the distribution of voting preferences comes fairly close to the distribution found in standard telephone surveys. Device and Data To investigate user’s evaluations of the TV debate, we developed a virtual RTR platform, the so-called “Debat-O-Meter,” an interdisciplinary research project of computer scientists and political scientists at the University of Freiburg. The Debat-O-Meter aims to be a full substitute and expansion of the “classical” (i.e., physical) RTR approach as it is based on mobile clients, such as smartphones, tablets or laptops. It incorporates an anonymous registration, a tutorial to introduce the user interface, the RTR module as the app’s core function, and a module to survey viewers immediately before and after the debate. Data collected through the survey or the RTR module were immediately transferred to a server and stored together with the users’ pseudonyms and a time stamp. To incentivize users to participate, a final module presented an individualized analysis of their evaluative behavior during the debate, featuring characteristics of a voting advice application. To ensure a high degree of standardization in data collection, we implemented different demands in the process: first, by implementing a phasic structure in our “virtual laboratory,” the consecutive modules of the Debat-O-Meter attends the classical RTR study concept. Second, the instructions made to the participants are regarded as a crucial point in RTR studies (Biocca et al., 1994; Maier et al., 2007). Thus, users were forced to complete an elaborated tutorial including precise instructions with respect to the media stimulus, the input mode as well as the number and specification of evaluable items and dimensions referring back to existing work in the field (Reinemann et al., 2005). The Debat-O-Meter was implemented as a push button device in reset mode with a 5-point scale ranging from double plus for very good to double minus for a very bad evaluation. If no button was pressed, a neutral evaluation was inferred. For statistical analysis, the data are recoded to a scale ranging from −2 to +2. Methods To assess reliability, we draw on McDonald’s omega and Cronbach’s alpha as two indicators of internal consistency. We then investigate internal validity in three steps: first, we analyze the net number of votes cast for each politician by all participants. Second, we assess the relation of the RTR signal to respondents’ partisanship with the help of Kruskal–Wallis rank sum tests with post hoc multiple comparisons to examine whether all partisan groups can indeed be distinguished from each other. Third, we rely on structural equation modeling to analyze how the RTR signal connects to other attitudes before, during, and after the debate, thereby assessing the criterion validity of our measures. Results Reliability When assessing the reliability of RTR measurement, the main focus in the literature rests on how internally consistent the evaluation is. That is, how well participants’ evaluations intercorrelate across the debate. One such approach has been laid out by Papastefanou (2013): its general idea is to interpret the stream of incoming votes as a test–retest scenario in which the RTR signal is taken as a longitudinal measurement that repeatedly asks the same item (i.e., how a politician is currently evaluated). If the reactions of the participants toward the politicians are captured reliably, their evaluations should not vary strongly, at least not in the short run. Thus, substantial correlations across the different quasi-items are taken as a sign that the measurement procedure can be deemed internally consistent. To assess reliability, we divided the debate into slices representing the respective candidates’ speaking phases to create the necessary quasi-items (each phase ended 5 s after the person had ceased to speak to allow for slower votes to accumulate) and continued by separately summing up positive and negative votes a participant had cast within a given interval for the respective politician. From these four participant-by-item tables, we then calculated both McDonald’s omega and Cronbach’s alpha. All in all, we ended up with 45 speaking phases for Angela Merkel and 49 for Martin Schulz.6 Referencing Bachl (2014, p. 109), we depart somewhat from the approach taken by Papastefanou (2013), who divided the debates into regular slices of 30 s each. We did so because we know from prior studies that participants nearly exclusively cast votes when a candidate is speaking, testifying to the high amount of attention devoted to verbal information (Nagel et al., 2012). Since we count no explicit reaction as a neutral evaluation, simply relying on regular slices as items would, for a given candidate, introduce arbitrary phases in which the whole audience reacts similarly to the candidate (because he or she is not speaking), thereby artificially increasing the intercorrelation of items across participants. Furthermore, we rely on two indicators of internal consistency, since alpha has recently come under criticism in the literature due to the (usually untenable) underlying assumption that all items load similarly onto the measured construct (Dunn, Baguley, & Brunsden, 2014). Omega, in turn, assumes no such restriction, allowing item loadings to vary (Kelley & Pornprasertmanit, 2016). As such, focusing on omega would already have been enough to assess reliability in a more concise manner. Yet, we opted for calculating alpha as well (including normal theory bootstrap confidence intervals as described in Padilla, Divers, & Newton, 2012), since an interesting implication of the different measurement models underlying both is that omega should score higher vis-à-vis alpha if factor loadings indeed vary across time. Since our items refer to a single measurement instruction, loadings that vary over time would signal that participants were at times more or less aware of them. By implication, any systematic difference between alpha and omega carries information on how well participants heed measurement instructions during the debate. As shown in Table 1, McDonald’s omega and Cronbach’s alpha generally suggest a high reliability for the RTR signal with regard to all participants. Usually, omega remains above 0.95 and, in all cases, it is above 0.90. A similar behavior can be observed for alpha, whose lowest value is 0.893, which is still substantially within the range generally accepted as highly reliable. Furthermore, the differences between both indicators are substantially negligible, suggesting that factor loadings are indeed constant, meaning that our participants generally adhered to measurement instructions as the debate progressed. Table 1. McDonalds’s Omega and Cronbach's Alpha Overall and by Ostensibly Partisan Evaluation Behavior (Including 95% Confidence Intervals) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) Open in new tab Table 1. McDonalds’s Omega and Cronbach's Alpha Overall and by Ostensibly Partisan Evaluation Behavior (Including 95% Confidence Intervals) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) . Merkel positive . Merkel negative . Schulz positive . Schulz negative . McDonald’s omega (95% CI)  All participants 0.974 (0.973, 0.975) 0.952 (0.950, 0.954) 0.962 (0.960, 0.963) 0.948 (0.946, 0.950)  Partisan 0.980 (0.979, 0.981) 0.962 (0.959, 0.964) 0.970 (0.968, 0.972) 0.955 (0.952, 0.958)  Non-partisan 0.944 (0.941, 0.946) 0.912 (0.908, 0.916) 0.934 (0.930, 0.937) 0.904 (0.900, 0.909) Cronbach’s alpha (95% CI)  All participants 0.963 (0.957, 0.969) 0.943 (0.935, 0.952) 0.946 (0.941, 0.951) 0.936 (0.927, 0.945)  Partisan 0.969 (0.964, 0.975) 0.952 (0.944, 0.960) 0.954 (0.949, 0.960) 0.942 (0.934, 0.951)  Non-partisan 0.933 (0.928, 0.937) 0.904 (0.896, 0.913) 0.920 (0.915, 0.925) 0.893 (0.883, 0.903) Open in new tab Minor differences appear when we single out participants who did not cast at least one positive and one negative votes for each politician (suggesting a biased evaluation behavior) and whose number of votes during the debate exceeded +2 SD of the distribution (suggesting comparatively intensive evaluative behavior). These individuals score slightly higher in terms of both omega and alpha than the remaining participants. We found a total of 1,915 such individuals. Split up along these lines, omega and alpha distribute as shown in the respective lower rows in Table 1. As can be seen, reliability is slightly weaker for less partisan individuals. Yet, as both indicators still remain above the threshold, the drop appears too small to warrant concern. In summary, H1 appears confirmed and it seems safe to answer our first question in the affirmative: measuring the debate perception with an online RTR system can yield a reliable signal. Internal validity Laboratory-based RTR studies possess high internal validity, meaning that it is comparatively easy to connect stimulus and (RTR) response. Their main drawback, however, is the difficulty to generalize findings beyond the concrete group of participants. For mobile RTR devices, it seems that the opposite constellation is more likely. While the large and diverse number of participants should make it easier to generalize to larger audiences, it is less certain that stimulus and response can be connected with equal confidence as in laboratory settings. In other words, for mobile RTR studies, internal validity is potentially an issue. To assess internal validity, we follow two ideas laid out with respect to laboratory-based studies (Maier et al., 2007, Reinemann et al., 2005). Essentially, these ideas investigate to what extent an RTR signal is correlated with other variables involved in the process of debate reception and candidate evaluation. The first test usually investigates the extent to which the RTR signal is correlated to participants’ partisan identification, which is known to reside at the heart of many political evaluation processes and which is taken as reference concept in the assessment of construct validity (see Maier, 2007). The better the RTR signal is predicted by partisanship, the more it behaves in theoretically plausible ways and the more valid the signal appears to be. The second approach often relies on structural equation modeling to particularly assess criterion validity. It draws on the relation between the RTR signal, candidate evaluation before and after the debate, and expected/perceived candidate debate performance (see Bachl, 2013). Concerning the relation of the RTR signal to respondents’ partisanship, we first calculated the net number of votes cast for each politician for all participants, excluding those who had skipped the question on partisanship or who had mentioned “other” parties. Given the underlying friend-or-foe logic of the social identity that resides at the heart of partisanship (Green et al., 2004), it is straightforward to derive expectations on how the live evaluation should connect to respondents’ partisanship: For adherents of the two major parties (CDU/CSU and SPD), we can expect a direct tendency to positively evaluate one’s “own” candidate while rejecting the respective opponent. For the two classical coalition partners, a similar logic should lead to positive votes for the traditional “ally” and negative evaluations for the opposing block. Thus, adherents of the Free Democratic Party (FDP) should generally view Angela Merkel more positively than Schulz while being less enthusiastic than adherents of CDU/CSU. A similar pattern should apply to Green partisans for Martin Schulz and adherents of the SPD, respectively. Again, relations should be reversed for the respective other side’s candidate. This leaves us with adherents of the left-wing camp, “DIE LINKE,” of the right-wing party, Alternative for Germany (AfD), and unattached viewers to account for. While adherents of Linke may partly see Angela Merkel in a positive light due to her decisions during the refugee crisis in 2015, a more general pattern would seem to be a favorable evaluation of Schulz given the common political camp and his perceived openness to a coalition including Linke. Concerning the AfD whose formation was largely driven as opposition to Merkel, we expect a strong hostility toward her plus a general dislike for Schulz, partly because the candidate is from the left camp and partly because part of the AfD’s self-legitimization rests on a critique of established parties. Unattached participants, in turn, should generally fall outside the friend-or-foe logic of the debate and show a more balanced evaluation. Table 2 shows the average net evaluations for participants with different partisan attachment (the associated Eta squares are 0.108 for Merkel and 0.130 for Schulz). For better comparison, we have also calculated the net average value across all participants. As we can directly see, the values correspond to our theoretical expectations as adherents of CDU/CSU and SPD show a decidedly positive attitude toward their “own” candidate (50.6 for Merkel and 64.4 for Schulz, respectively) and a clearly negative attitude toward their “enemy” (−10.3 for Schulz vs. −11.0 for Merkel). The same is true for adherents of likely coalition partners—here, the Greens show a weak negative (i.e., below average) tendency toward Merkel while reacting clearly positively toward Schulz. The same is visible for FDP adherents, albeit reversed. For Linke, we can observe a clear preference for Schulz and (even stronger than expected) a negative stance toward Merkel. The most negative reaction can be found for AfD adherents who, as expected, show a strongly negative evaluation of Merkel while being largely neutral toward Schulz. Respondents without an attachment come to lie fairly close to the distribution for the whole sample, indicating that their evaluation is indeed more open. While ideally we would like to compare the relation to laboratory-based data, such information is unavailable for the 2017 debate at the time of writing. However, similar patterns of support and rejection within and across camps can also be found in parts of the literature albeit only reported for vote intentions (e.g., Bachl, 2013, p. 144) or aggregated partisan groups (Jansen & Glogger, 2017, p. 49/51). Yet they also replicate largely with respect to candidate evaluations both within our own and external survey data (scaled to a similar metric and demeaned for easier comparison, see Table 2), further bolstering our confidence in the measurement. Since the evaluation behavior of the respective partisan groups corresponds to our theoretical assumptions, we can confirm H2. Table 2. Average Net Evaluations for Participants by Partisan Attachment . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 Note. Mean RTR scores were calculated by adding positive (“++” = +2; “+” = +1; “−” = −1; “−−” = −2) and negative votes for each participant across the debate. The resulting individual net scores were then averaged across partisan groups. Mean evaluation before the debate (possible range: +2 to −2) was calculated separately for each partisan group first and then demeaned by subtracting the average evaluation across all participants. Mean evaluation from Politbarometer was calculated accordingly after rescaling the original item (+5 to −5) to a scale from +2 to −2. CDU = Christian Democratic Union; PID = Partyidentification; RTR = real-time response; SPD = Social Democratic Party. Open in new tab Table 2. Average Net Evaluations for Participants by Partisan Attachment . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 . Mean . SPD . Greens . CDU . FDP . AfD . Linke . No PID . Mean RTR score  Merkel 17.5 −11.0 13.4 50.6 29.2 −37.9 −2.9 10.2  Schulz 23.0 64.4 37.8 −10.3 9.3 16.4 47.7 26.9 Mean evaluation before debate, demeaned  Merkel (0.46) −0.67 0.02 0.85 0.31 −1.42 −0.71 −0.28  Schulz (−0.09) 1.00 0.37 −0.66 −0.48 −0.37 0.41 0.00 Mean evaluation, Politbarometer (August 29–31, 2017), demeaned  Merkel (0.89) −0.22 −0.01 0.60 0.08 −1.48 0.08 −0.25  Schulz (0.23) 0.78 0.31 −0.32 −0.63 −0.89 0.13 −0.07 Note. Mean RTR scores were calculated by adding positive (“++” = +2; “+” = +1; “−” = −1; “−−” = −2) and negative votes for each participant across the debate. The resulting individual net scores were then averaged across partisan groups. Mean evaluation before the debate (possible range: +2 to −2) was calculated separately for each partisan group first and then demeaned by subtracting the average evaluation across all participants. Mean evaluation from Politbarometer was calculated accordingly after rescaling the original item (+5 to −5) to a scale from +2 to −2. CDU = Christian Democratic Union; PID = Partyidentification; RTR = real-time response; SPD = Social Democratic Party. Open in new tab To further explore the validity of the relation, we also perform Kruskal–Wallis rank sum tests with post hoc multiple comparisons to determine whether all partisan groups can indeed be distinguished from each other. Both global tests were significant at p < 0.0001, and nearly all groups were distinguishable in the post hoc comparisons as a significance level of 0.05. The only exceptions were participants without a partisanship and adherents of the Greens in the case of Merkel, which is not too surprising given that the party appears open to coalitions with both major parties. For Schulz, the tests failed to report a distinction between adherents of Greens versus Linke, Linke versus SPD, and AfD versus individuals without a partisanship. The two former findings are comparatively narrow misses and appear reasonable given the common camp and Schulz’ openness for coalitions with Linke, and the latter may well stem from the dominant motive of dislike for Merkel on behalf of AfD adherents. Considering the valid statistical discrimination of the respective partisan groups, we are thus able to confirm H3. As such partisanship quite robustly predicts real-time evaluation, it leads us to answer question 2.1 in the affirmative, as well. A final piece of evidence can be generated by examining how the RTR signal connects to other immediate attitudes about the debate. In the literature, a common approach to this question can be found in structural equation models that connect attitudes before, during, and after the debate. Since we are assuming no latent variables, a path model is sufficient and our main focus rests on the question whether known structures from classical laboratory-based studies can be generated from our data. As a benchmark for comparison, we can draw on the work by Bachl (2013). In his model, perceived performance during the debate (as captured by the RTR signal) is affected both by prior candidate evaluation and expected debate performance. In turn, all three variables govern how the candidate performance is assessed after the debate that finally affects overall candidate evaluation. In addition, candidate evaluation after the debate rests on live debate performance and prior candidate evaluations. We have reproduced the path structure from Bachl’s model with our data and tried to follow his operationalization as close as possible to allow for comparison (see p. 177 for his operationalization).7 Again, both models confirm the general picture of valid upstream and downstream relations for the RTR signal. It should be noted that the model for Schulz does not pass the likelihood-ratio test, which seems to be of lesser relevance, though, as the test has been shown to regularly reject models with a large number of cases (Weiber & Mühlhaus, 2014, p. 204). Both root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR) statistics that take model structure into account and all three available comparative fit indices (comparative fit index; CFI, TFI and normed fit index; NFI) in turn indicate a good model fit (for Merkel: RMSEA = 0.029 [95% CI: 0.009, 0.054], pClose = 0.908, SRMR = 0.018, CFI = 1.000, TFI = 0.997, NFI: 1.000; for Schulz: RMSEA = 0.015 [95% CI: 0.000, 0.043], pClose = 0.985, SRMR = 0.025, CFI = 1.000, TFI = 0.999, NFI = 1.000). The models presented in Figure 1 show both standardized coefficients and the associated R2 values. Figure 1. Open in new tabDownload slide Structural equation model—Angela Merkel and Martin Schulz. Figure 1. Open in new tabDownload slide Structural equation model—Angela Merkel and Martin Schulz. Both models indicate that candidate evaluation before the debate has a substantial influence on who is expected to win. The higher a candidate is esteemed, the more likely a participant expects the candidate to decide the debate for him- or herself. The effect of expected performance on perceived performance after the debate is still noticeable but much less pronounced. The same holds true for prior evaluation. In both models, immediate debate perception strongly affects how the candidate was seen to perform in retrospect. As such, political leanings may affect expectations but they are not able to undo actual perception. Considering that correlations of real-time evaluations and post-debate verdicts on the candidates’ performances are significant and substantial even after controlling for subsequent variables of debate reception, we can state criterion validity to be given for our data and, thus, we are able to confirm H4 as well. Interestingly, however, the effect of prior evaluation on real-time perception is substantial, indicating that respondents are strongly inclined to view candidate debate performance through a colored lens. Proceeding to post-debate candidate evaluation, there is a noticeable direct effect of post-debate performance (in turn shaped by the RTR signal) accompanied by a substantial direct effect coming from how the candidate was evaluated during the discussion. By and large, the relative sizes of the coefficients are comparable to similar models in the literature for laboratory-based settings (see Bachl, 2013, p. 184), suggesting that our measurement yields comparable results. Furthermore, that actual debate performance (RTR signal) exerts a comparably larger influence on downstream variables and the lesser effect of prior evaluations on post-debate performance in the case of Schulz nicely fits the idea that his image as challenger was not as solidified as that of incumbent Merkel. We can therefore also answer our question 2.2 in the affirmative. Conclusion In this article, we have tried to establish the reliability and validity of virtualized RTR measurements outside the laboratory. While existing work has already shown that, in laboratory settings, both physical and virtual devices perform similarly (Metz et al., 2016) and that laboratory and virtual settings generally correlate as expected (Maier, Hampe, et al., 2016), the question remained whether research can also rely on data obtained from home-recorded measurements on a larger scale. Given the setting and the impossibility to interact with the participants, the question is crucial when real-time response measurements is extended to bigger, more diverse audiences across larger geographical areas. To address this issue, we relied on the strategy to assess the reliability and validity of a virtualized measurement in the light of known evidence from laboratory-based settings. To do so, we have drawn on RTR data gathered in the context of the major TV debate between the two chancellor candidates for the 2017 German general election, Angela Merkel and Martin Schulz. This dataset is the largest collection of RTR data available today. By focusing on participants who have fully followed the study protocol, we have singled out those who come closest to laboratory participants to obtain the optimal grounds for comparison. We are quite confident that mobile RTR may provide diverse and large N data and that recruiting participants through media partnerships seems a viable and cost-effective option. For some participant attributes, such as regional coverage, gender, and vote choice, the sample we obtained was already quite satisfactory. Yet we also noticed that the sample tendency from laboratory settings, that is, a concentration of individuals who are more interested in politics, more educated, and in part, generally younger, was also present in our data. While the sheer number of individuals in principle allows us to look into details such as “rare” groups and offers the possibility of post-stratification, it still seems important to motivate the remaining groups to join as well to further improve data quality. In the end, however, these problems all appear manageable especially when the financial means to resort to approved methods of recruitment, such as telephone surveys or ready-made panels, are in place, but even if such financial sources were present, the fact that all relevant population groups were already featured in our data suggests that established recruitment practices might not be necessary, given a suitable re-weighting scheme. In terms of reliability, we can positively state that our data are highly consistent internally and that this consistency qualifies in the ranges known from laboratory studies. Accordingly, switching to a virtualized environment does not seem to cause substantial problems in this respect. The same conclusion can be drawn when assessing construct and criterion validity: Across all groups, partisanship connected to RTR evaluations associates as expected, meaning that it is generally possible to step outside the laboratory without a larger risk to compromise internal validity. This connection is so stable that nearly all of the partisan groups could readily be distinguished in post hoc tests. Going into the up- and downstream relations of live debate perception, we also found that established models for laboratory settings held true for our participants. We therefore conclude that mobile RTR data possess an internal validity comparable to laboratory-based studies. Another important aspect is that our sample of individuals is restricted to those who fully obeyed the experimental protocol—insofar we cannot say much about those who, for instance, joined late, switched to other channels, lost interest, failed to take the situation seriously, etc. Yet, given that our results come so close to laboratory-based evidence, we have good reasons to be confident that most of the findings should also extend to the rest of our participants. Another issue to think about is that of dependence among units. Within a laboratory setting, it is fairly straightforward to not only select participants individually but also to tell them not to interact, thus providing a standardized environment. At home, the situation is different. Researchers have no direct control over whether respondents watch together and whether they exchange opinions, discuss or even argue. This potentially introduces patterns of dependence among units that (as a limitation of our current work) we can only partly account for when we measure over the Internet. There seem to be two ways to react to this challenge: One would be to try and increase standardization by, for example, explicitly asking participants not to talk during the debate, or maybe even to watch alone. Done right, this can potentially yield results coming close to that of laboratory situations and help to extend a known and understood setting into the population. However, alternatively, one may also try to monitor the context in which the debate was watched and use it as an ex post control or even as an independent variable. Did the respondent watch together with others? Did the respondent talk to these persons? Were they of the same opinion? This would allow to actively model the effects of interdependence and interaction on the perception of televised debates, something that is currently hardly understood. In this article, we have asked whether it is safe to rely on RTR data gathered in virtualized settings outside the laboratory. In summary, we find that the answer to our question is a quite consistent “yes.” RTR data collected with the help of virtualized devices over the Internet appear reliable and valid and behave much the same way that data gathered in a laboratory does. Arising differences can readily be explained as implications of our existing models and thus do not pose problems of data reliability or validity. Therefore, it seems possible to extend RTR research to large audiences at home without compromising the study objectives. As such, smartphones and other devices with an Internet connection open up a new way to explore how large-scale audiences perceive political debates in the confines of their private homes. While laboratory-based studies still have advantages and justifications, we have shown that they can well be supplemented with alternative ways of measurement, which allows research to address new innovative questions. In a sense, the door is open. Supplementary Data Supplementary Data are available at IJPOR online. Biographical Note Thomas Waldvogel is a research associate at the Department of Political Science, Albert-Ludwigs University of Freiburg, Germany. His research focuses on political communication, election campaigns and political education. Thomas Metz is a research associate at the Department of Political Science, Albert-Ludwigs University of Freiburg, Germany. His research focuses on election studies, peace and conflict studies and statistical methods in political science. Footnotes 1 We recruited over 90 regional and national newspapers and online media, including Süddeutsche Zeitung (circulation 350,000), Focus online (23.7 million users), Redaktionsnetzwerk Deutschland (providing content for about 40 newspapers), and Sat.1 national TV. In exchange for coverage of the project and calls for participation, media partners were offered interviews and articles to support coverage plus a preliminary analysis directly after the debate. Participants received a customized analysis of their evaluative behavior plus a shortened version of the media partners’ analysis. 2 The Debat-O-Meter is implemented with an hypertext transfer protocol secure protocol for privacy protection. IPs were automatically converted to consecutive numbers, and the original addresses were deleted after 24 hr. We also recorded participants’ user agent header for information on the users’ devices (PC or mobile). 3 To prevent abuse, the Debat-O-Meter uses measures such as Captchas to distinguish human users from automated scripts and monitors user behavior in real time to spot suspicious behavior. All in all, suspicious users only made up a small part of the sample. 4 This choice eliminated 478 individuals sharing 228 IP addresses. Most of these (210 addresses) were two people of comparable age and opposite sex, suggesting that they were couples. 5 The lion’s share of the reduction comes from participants outside Germany and users who joined later during the debate. 6 One speaking phase for Merkel had to be discarded because it lasted only for one second (a brief “yes” to a minor question) and elicited no reaction from the audience. 7 Candidate evaluation is an overall assessment between −2 (very low) and +2 (very high), and debate performance is measured as the mean RTR evaluation (ranging from −2 to +2). While Bachl used separate items, we measured pre-/postdebate performance by recoding expected/perceived status as debate winner. Respondents who expected/perceived Merkel to win were coded +1, those expecting/perceiving a draw were given a value of 0, and an expected/perceived victory for Schulz was coded as −1. We used the same scheme for Schulz. Given our coding, assumptions for normal distribution could not be sustained, so we estimated all models with robust standard errors. Acknowledgment This article is our own work, but it would not have been possible without support and feedback from the rest of the Debat-O-Meter team, which we greatly appreciate. References Bachl M. ( 2013 ). Die Wahrnehmung des TV-Duells. In Bachl M., Brettschneider F., Ottler S. (Eds.), Das TV-Duell in Baden-Württemberg 2011 (pp. 135 – 169 ). Wiesbaden : Springer VS . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Bachl M. ( 2014 ). Analyse rezeptionsbegleitend gemessener Kandidatenbewertungen in TV-Duellen. (PhD thesis, Universität Hohenheim). Benoit W. L. , Hansen G. J., Verser R. M. ( 2003 ). A meta-analysis of the effects viewing U.S. presidential debates . Communication Monographs , 70 ( 4 ), 335 – 350 . Google Scholar Crossref Search ADS WorldCat Biocca F. , David P., West M. ( 1994 ). Continuous response measurement (CRM). In Lang A. (Ed.), Measuring psychological responses to media messages (pp. 15 – 64 ). Hillsdale : Erlbaum . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Bortz J. , Döring N. ( 2006 ). Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler . Heidelberg : Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Boyd T. C. , Hughes G. D. ( 1992 ). Validating realtime response measures. In J. F. Jr. Sherry, & B. Sternthal (Eds.), NA - Advances in Consumer Research (Vol. 19, pp. 649–656). Provo, Utah: Association for Consumer Research. Boydstun A. E. , Glazier R. A., Pietryka M. T., Resnik P. ( 2014 ). Real-time reactions to a 2012 presidential debate . Public Opinion Quarterly , 78 , 330 – 343 . Google Scholar Crossref Search ADS WorldCat Campbell A. , Converse P., Miller W., Stokes D. ( 1960 ). The American voter . New York : Wiley . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Dunn T. , Baguley T., Brunsden V. ( 2014 ). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation . British Journal of Psychology , 105 ( 3 ), 399 – 412 . Google Scholar Crossref Search ADS PubMed WorldCat Fenwick I. , Rice M. D. ( 1991 ). Reliability of continuous measurement copy-testing methods . Journal of Advertising Research , 31 ( 1 ), 23 – 29 . Google Scholar OpenURL Placeholder Text WorldCat Green D. P. , Palmquist B., Schickler E. ( 2004 ). Partisan hearts and minds: Political parties and the social identities of voters . New Haven : Yale University Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hallonquist T. , Peatman J. G. ( 1947 ). Diagnosing your radio program, or the program analyzer at work. In Institute for Education by Radio (Ed.), Education on the air. Yearbook of the Institute for Education by Radio (pp. 463 – 474 ). Columbus, OH: Ohio State University Press. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hallonquist T. , Suchmann E. E. ( 1944 ). Listening to the listener. Experiences with the Lazarsfeld-Stanton program analyzer. In Lazarsfeld P. F., Stanton F. (Eds.), Radio research 1942–1943 (pp. 265 – 334) . New York : Arno Press . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Hughes G. D. , Lennox R. ( 1990 ). Realtime response research: Construct validation and reliability assessment. In Bearden W., Parasuraman A. (Eds.), Enhancing knowledge development in marketing (pp. 284 – 288 ). Chicago : American Marketing Association . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Jansen C. , Glogger I. ( 2017 ). Von Schachteln im Schaufenster, Kreisverkehren und (keiner) PKW-Maut. In Faas T., Maier J., Maier M. (Eds.), Merkel gegen Steinbrück (pp. 31 – 58 ). Wiesbaden : Springer VS . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Kelley K. , Pornprasertmanit S. ( 2016 ). Confidence intervals for population reliability coefficients . Psychological Methods , 21 ( 1 ), 69 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat Maier J. ( 2007 ). Erfolgreiche Überzeugungsarbeit. Urteile über den Debattensieger und die Veränderung der Kanzlerpräferenz. In Maurer M., Reinemann C., Maier J., Maier M. (Eds.), Schröder gegen Merkel (pp. 91 – 109) . Wiesbaden : VS Verlag . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Maier J. , Faas T., Maier M. ( 2014 ). Aufgeholt, aber nicht aufgeschlossen: Ausgewählte Befunde zur Wahrnehmung und Wirkung des TV-Duells 2013 zwischen Angela Merkel und Peer Steinbrück . Zeitschrift für Parlamentsfragen , 45 ( 1 ), 38 – 54 . Google Scholar Crossref Search ADS WorldCat Maier J. , Hampe J. F., Jahn N. ( 2016 ). Breaking out of the lab: Measuring real-time responses to televised political content in real-world settings . Public Opinion Quarterly , 80 ( 2 ), 542 – 553 . Google Scholar Crossref Search ADS PubMed WorldCat Maier J. , Maurer M., Reinemann C., Faas T. ( 2007 ). Reliability and validity of real-time response measurement: A comparison of two studies of a televised debate in Germany . International Journal of Public Opinion Research , 19 ( 1 ), 53 – 73 . Google Scholar Crossref Search ADS WorldCat Maier J. , Rittberger B., Faas T. ( 2016 ). Debating Europe: Effects of the “Eurovision Debate” on EU attitudes of young German voters and the moderating role played by political involvement . Politics and Governance , 4 ( 1 ), 55 – 68 . Google Scholar Crossref Search ADS WorldCat Maurer M. , Reinemann C., Maier J., Maier M. (Hrsg.). ( 2007 ). Schröder gegen Merkel . Wiesbaden : VS Verlag . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC McKinney M. , Carlin D. ( 2004 ). Political campaign debates. In Kaid L. L. (Ed.), Handbook of political communication research (pp. 203 – 234) . Mahwah : Lawrence Erlbaum . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Metz T. , Wagschal U., Waldvogel T., Bachl M., Feiten L., Becker B. ( 2016 ). Das Debat-O-Meter . Zeitschrift für Staats- und Europawissenschaften , 14 ( 1 ), 124 – 149 . Google Scholar Crossref Search ADS WorldCat Nagel F. , Maurer M., Reinemann C. ( 2012 ). Is there a visual dominance in political communication? Journal of Communication , 62 ( 5 ), 833 – 850 . Google Scholar Crossref Search ADS WorldCat Padilla M. , Divers J., Newton M. ( 2012 ). Coefficient alpha bootstrap confidence interval under nonnormality . Applied Psychological Measurement , 36 ( 5 ), 331 – 348 . Google Scholar Crossref Search ADS WorldCat Papastefanou G. ( 2013 ). Reliability and validity of RTR measurement device. Gesis. Leibniz-Institut für Sozialwissenschaften. Working Paper 2013-27. Reinemann C. , Maier J., Faas T., Maurer M. ( 2005 ). Reliabilität und Validität von RTR-Messungen . Publizistik , 50 ( 1 ), 56 – 73 . Google Scholar Crossref Search ADS WorldCat Schill D. , Kirk R. ( 2014 ). Courting the swing voter: “Real time” insights into the 2008 and 2012 U.S. presidential debates . American Behavioral Scientist , 58 ( 4 ), 536 – 555 . Google Scholar Crossref Search ADS WorldCat Schill D. , Kirk R., Jasperson A. E. ( 2017 ). Political communication in real time. New York : Routledge . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Schwerin H. ( 1940 ). An exploratory study of the reliability of the “program analyzer” . Journal of Applied Psychology , 24 ( 6 ), 742 – 745 . Google Scholar Crossref Search ADS WorldCat Wagschal U. , Waldvogel T., Metz T., Becker B., Feiten L., Weishaupt S., Singh K. ( 2017 ). Das TV-Duell und die Landtagswahl in Schleswig-Holstein . Zeitschrift für Parlamentsfragen , 48 ( 3 ), 594 – 613 . Google Scholar Crossref Search ADS WorldCat Weiber R. , Mühlhaus D. ( 2014 ). Strukturgleichungsmodellierung . Heidelberg : Springer . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC © The Author(s) 2020. Published by Oxford University Press on behalf of The World Association for Public Opinion Research. All rights reserved. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

International Journal of Public Opinion ResearchOxford University Press

Published: Dec 23, 2020

There are no references for this article.