Nonresponse and Measurement Error Variance among Interviewers in Standardized and Conversational Interviewing

Nonresponse and Measurement Error Variance among Interviewers in Standardized and Conversational... Abstract Recent methodological studies have attempted to decompose the interviewer variance introduced in interviewer-administered surveys into its potential sources, using the Total Survey Error framework. These studies have informed the literature on interviewer effects by acknowledging interviewers’ dual roles as recruiters and data collectors, thus examining the relative contributions of nonresponse error variance and measurement error variance among interviewers to total interviewer variance. However, this breakdown may depend on the interviewing technique: some techniques emphasize behaviors designed to reduce variation in the answers collected by interviewers more so than other techniques. The question of whether the contributions of these error sources to total interviewer variance change for different interviewing techniques remains unanswered. Addressing this gap in knowledge has important implications for interviewing practice because the technique used could alter the relative contributions of variance in these error sources to total interviewer variance. This article presents results from an experimental study mounted in Germany that was designed to answer this question about two specific interviewing techniques. A national sample of employed individuals was first selected from a database of official administrative records, then randomly assigned to interviewers who themselves were randomized to conduct either conversational interviewing (CI) or standardized interviewing (SI), and finally measured face-to-face on a variety of cognitively challenging survey questions with official values also available for verifying the accuracy of responses. We find that although nonresponse error variance does exist among interviewers for selected measures (especially respondent age in the CI group), measurement error variance tends to be the more important source of total interviewer variance, regardless of whether interviewers are using CI or SI. 1. INTRODUCTION Interviewer variance continues to be a vexing problem for survey researchers. Survey estimates suffer from reduced quality when interviewer-specific estimates vary despite the assignment of cases with similar features to different interviewers. This variability reduces effective sample sizes in a manner similar to cluster sampling (Elliott and West 2015; Schnell and Kreuter 2005; Groves 2004; O’Muircheartaigh and Campanelli 1998), increasing the variance of estimates given fixed costs of data collection. More specifically, one can define a multiplicative “interviewer effect” on the variance of an estimator of the mean for a particular survey item as 1+(m¯−1)ρint⁡. In this well-known expression, m¯ corresponds to the average number of interviews completed across all interviewers, and ρint⁡ is the intra-interviewer correlation (IIC) of the responses to the survey item. The IIC is defined as the ratio of the between-interviewer variance in the responses to the sum of the between- and within-interviewer variance in the responses (i.e., the total variance in the responses). Thus, given an estimated IIC of 0.02 and an average of thirty interviews completed per interviewer, one would expect the variance of the estimated mean to be inflated by 58%. The development of strategies for minimizing interviewer variance requires the execution of carefully designed studies that isolate the effects of interviewers on survey measures of interest and identify the reasons for these effects. The implementation of these types of studies, however, is quite difficult in practice. Random assignment of sampled cases to interviewers (i.e., interpenetrated sampling), which is required for estimation of interviewer effects on a given survey outcome, is often not feasible for cost reasons. This introduces the possibility of variance among interviewers in the features of sampled cases prior to any recruiting or measurement efforts. Despite the extra effort and costs, researchers have been able to randomly assign cases to interviewers (or isolate interviewer effects analytically) in several studies, and the effects of interviewers on both response rates and survey measurement are now well documented (Schaeffer, Dykema, Maynard 2010; West and Blom, 2017). Several studies have suggested that interviewer variance in respondent reports may actually be arising from nonresponse error variance among interviewers, or the recruitment of individuals with different features by different interviewers (Stock and Hochstim 1951; Moser and Stuart 1953; Tucker 1983; Stokes and Yeh 1988). More recently, methodological studies have attempted to take advantage of the information in sampling frames about both respondents and nonrespondents, and decompose the variability among interviewers into three sources: 1) sampling error variance, introduced by a lack of interpenetrated assignment; 2) nonresponse error variance, or the variance among interviewers in the true values of the cases that they successfully recruit; and 3) measurement error variance, or the additional variance among interviewers introduced by the deviations of the responses that they collect from their true values (West and Olson 2010; West, Kreuter, and Jaenichen 2013). Although the literature has paid far more attention to measurement than nonresponse as the origin of interviewer variance, these recent decomposition studies have indicated that interviewer variance in respondent reports can in fact arise from nonresponse error variance. This decomposition of interviewer variance may also depend on the interviewing technique used. Standardized interviewing (SI) emphasizes behaviors designed to reduce variation between interviewers in the responses they collect (e.g., Fowler and Mangione 1990), while conversational interviewing (CI) encourages interviewers to say what is necessary to ensure that respondents understand survey questions as they are intended (e.g., Schober and Conrad 1997). No studies have performed comparative decompositions of the interviewer variance introduced by these two techniques. In theory, standardized interviewers provide all respondents with the exact same question stimuli and respond to respondents’ questions, concerns, and confusion with neutral probes (e.g., “let me repeat the question”). This technique standardizes question administration, and while interviewers are known to deviate from its strict principles in real-world settings (Peneff 1988; Houtkoop-Steenstra 1995; Schaeffer etal. 2010; Haan, Ongena, and Huiskes 2013; Ackermann-Piek and Massing 2014; see also Suchman and Jordan 1990), it theoretically removes the interviewer as a potential source of variable error during the measurement process, reducing the probability of interviewer effects due to the way questions are asked (Henson, Cannell, and Lawson 1976; Groves and Magilavy 1986; Fowler and Mangione 1990; Mangione, Fowler, and Louis 1992). Any measurement error variance among interviewers in SI would therefore be expected to arise from non-verbal behaviors, behaviors that deviate from SI protocol (e.g., excessive probing), or the effects of demographic characteristics of the interviewer during measurement (West and Blom 2017). Conversational interviewers, on the other hand, initially read questions exactly as worded and are then granted the flexibility to say whatever is necessary to ensure respondent comprehension if respondents misunderstand questions or express confusion or uncertainty about the questions. While CI can increase administration time compared to SI1, it has repeatedly been shown to increase the accuracy of responses to factual questions relative to SI (Schober and Conrad 1997; Conrad and Schober 2000; Schober, Conrad, and Fricker 2004; Schober, Conrad, Dijkstra, and Ongena 2012; Hubbard, Antoun, and Conrad 2012). In theory, one would therefore expect that CI will not introduce interviewer variance in factual questions either, provided that the interviewers are successful in consistently achieving accurate measurements. A recent study (West, Conrad, Kreuter and Mittereder, 2017) found that CI rarely increased overall interviewer variance relative to SI for a variety of survey items related to employment history and housing conditions and that these increases did not offset gains in the overall quality of estimates due to the increased response accuracy engendered by CI. Nevertheless, any interviewer variance introduced by either technique could still compromise the quality of selected survey estimates, making knowledge of whether it results from errors during recruitment (nonresponse) or interviewing (measurement) key to improving survey quality. No studies have addressed this gap in knowledge. A related interviewing approach that has been found to improve the accuracy of respondent recall in multiple studies is event history calendar (EHC) interviewing (e.g., Belli, Lee, Stafford, and Chou 2004; Sayles, Belli, and Serrano 2010). When survey respondents are asked to recall potentially complex autobiographical measures, an interviewer employing the EHC approach uses a flexible, conversational style to encourage the respondent to employ narrative use of retrieval cues available in autobiographical memory. Several studies have compared data collected using the EHC approach to data collected using a more standardized interviewing approach, and these studies have repeatedly demonstrated the benefits of the EHC approach for response accuracy (Belli 1998; Belli, Shay, and Stafford 2001; Belli etal. 2004; Sayles etal. 2010). Only one study to date has specifically considered the interviewer variance introduced by this more flexible approach relative to a more standardized approach (Sayles etal. 2010), finding evidence of modest increases in interviewer variance due to the use of the EHC approach. Much like the CI literature, no studies to date have considered the sources of any interviewer variance introduced by the EHC approach. Although techniques like CI and the EHC approach are designed to reduce measurement error, it is possible that the additional flexibility granted to interviewers using these approaches could result in measurement error variance accounting for a larger portion of the total interviewer variance introduced by CI, relative to SI. This may be due to uneven implementation of the techniques across interviewers; some interviewers may go off on tangents, and others may present respondents with incorrect definitions of key concepts or make erroneous statements when attempting to address respondent confusion. Hypotheses related to the expected contributions of nonresponse error variance for these two techniques, on the other hand, are harder to formulate given the lack of prior work in this area. One possibility is that some of the conversational interviewers who are trained to say what is required to ensure that respondents understand questions as intended during measurement may also improvise (tailor) more when securing respondent cooperation, thus introducing interviewer variance in nonresponse rates (e.g., Groves, Cialdini, and Couper 1992; Morton-Williams 1993; Campanelli, Sturgis, and Purdon 1997; Sturgis and Campanelli 1998; Snijkers, Hox, and de Leeuw 1999; Groves and McGonagle 2001). For example, some conversational interviewers may emphasize certain features of the survey design (e.g., incentives, timing, etc.) more so than others when tailoring their recruitment efforts to respondent concerns. While this differential emphasis may have the potential to mitigate the negative leverage of specific design features on the decision to participate for certain respondents, following the ideas of leverage-saliency theory (Groves, Singer, and Corning 2000), it could lead to nonresponse error variance among interviewers if the features emphasized by some conversational interviewers tend to be strongly linked to key survey measures. For instance, a conversational interviewer who tends to emphasize the incentives when convincing people to participate may end up only recruiting persons with lower socio-economic status who are not as interested in the topic, and these individuals may respond on key measures quite differently from individuals recruited by another interviewer emphasizing different design features. On the other hand, one could also argue that higher rates of tailoring among conversational interviewers could lead to more systematically successful recruitment, and interviewers using a more standardized recruitment approach (which may work well for some sampled cases but not others) may vary more in terms of achieved response rates. Regardless of the (unknown) effects of these different measurement techniques on recruitment practices, whether variability in response rates across interviewers leads to increased interviewer variance in nonresponse errors is also currently unknown (Groves 2006; Groves and Peytcheva 2008). The concept of “liking” could also lead to nonresponse error variance, regardless of whether CI or SI is used. Interviewers may tend to recruit respondents with similar socio-demographic features. For instance, younger interviewers may tend to recruit younger respondents (West and Blom 2017). This variance among interviewers in the types of respondents recruited could introduce nonresponse error variance if these socio-demographic features are correlated with the survey measures of interest. It may be the case that during recruitment, more conversational interviewers tend to draw more attention to their socio-demographic features while tailoring and maintaining engagement, but this is presently unknown. Initial work is needed to understand the extent to which nonresponse error variance exists for either interviewing technique and whether observable features of the interviewers play a role in introducing this type of variance. To date, these alternative hypotheses have not been tested for the CI and SI techniques. If CI and SI vary in terms of the relative contributions of nonresponse error variance and measurement error variance to total interviewer variance, then deployment of one technique or the other will likely require more technique-specific training than is now common in order to reduce both types of error. Researchers could eventually design follow-up studies to test the effects of this type of training on the measurement and nonresponse error variance introduced by a particular technique, but an essential first step is the quantification and comparison of the contributions of nonresponse error variance and measurement error variance to the total interviewer variance introduced by the CI and SI techniques. We base these comparisons on data collected from a large national sample in Germany, where interviewers were randomly assigned to use either CI or SI. The randomly sampled persons assigned to each interviewer were measured on a variety of survey items, some of which had validation information available on the sampling frame. We first identify items with significant interviewer variance overall and then decompose this variance into the sources mentioned above. 2. METHODS 2.1 Sample Size Considerations Given our specific objectives, we wanted to design a study that would have enough power to detect differences in the variance components used to compute IICs between standardized and conversational interviewers as being significant. We therefore began by performing a customized Monte Carlo simulation study to assess the power of alternative designs to detect a range of differences in variance components (see the SAS code in the online supplementary material). A review of the literature on interviewer effects (West and Olson 2010; Schnell and Kreuter 2005; O’Muircheartaigh and Campanelli 1998) suggests that most IICs will range from 0.01 to 0.12 in face-to-face surveys, with many falling below 0.02. Furthermore, in our recent work analyzing data on survey items from a face-to-face study in Germany (West, Kreuter, and Jaenichen 2013) with subject matter similar to that considered in the present study (employment histories), we found that these IICs ranged from below 0.01 to approximately 0.09. An IIC of 0.09 would quadruple the variance of an estimate, reducing the effective sample size by 75%. We used these earlier results to perform power calculations for this study, ensuring that we would have enough power to detect differences of this magnitude for both continuous and binary survey measures of employment history.2 In the simulation studies, we found that having thirty interviewers measuring a continuous item for each of the two techniques (sixty interviewers total) and thirty respondents for each interviewer (or 1,800 respondents in total) would yield approximately 80% power to detect a 6.6-fold difference in the between-interviewer variance components used to compute the IIC as significant, based on a likelihood ratio test (West and Elliott 2014), with a 5% level of significance. A difference of this size in the variance components falls within the aforementioned range of IICs (0.01 to 0.09) that we have seen in related studies with similar subject matter. Furthermore, we found that having thirty interviewers in each of the two experimental groups measuring a binary item and thirty respondents for each interviewer would yield approximately 82% power to detect a similar difference in the between-interviewer variance components, again using a likelihood ratio test with a 5% level of significance. We therefore based our data collection protocol on meeting these targets. 2.2 Sampling and Data Collection Next, we designed and conducted an in-person data collection in fifteen large areas in Germany. In each area, we selected a simple random sample of 480 currently employed adults who had experienced at least one period of unemployment (or “unemployment spell”) from the Integrated Employment Biographies (IEB) database, which contains official government information on employment histories.3 The overall sample size was thus n = 15 × 480 = 7,200, allowing for a 25% response rate given the aforementioned power analysis. Four professional interviewers were assigned to work in each of the fifteen areas (sixty interviewers total), and each was assigned 120 of the randomly sampled cases in total (i.e., interpenetrated sample assignment, conditional on the area). We note that the design of this study differs from most other studies of interviewer effects where interviewers work in multiple areas, in that interviewers were nested within the fifteen areas in Germany. More recent work has suggested that designs allowing interviewers to work in different areas introduce important sample size and sample assignment considerations (Vassallo, Durrant, and Smith 2016; e.g., how many areas should be worked by an interviewer for interpenetration to hold?), but this was not a feature of our design. We therefore conducted the simulation study described above, and the SAS code in the online supplementary material can be readily applied for similar calculations. Additional details related to the sampling used in this study can be found in West, Conrad, Kreuter, Mittereder (2017). Two randomly selected interviewers in each area were thoroughly trained in the use of CI, and the other two were similarly trained in the use of SI. Both groups of interviewers were trained to administer a 30-minute instrument in a face-to-face setting. This instrument included several questions that met at least one of the following four criteria: 1) the question had been tested and used in previous national surveys in Germany; 2) answering the question was judged to require complex cognitive processes, mainly related to employment history; 3) the question had historically required a large amount of interviewer-respondent interaction, according to German researchers; and 4) official administrative values were available in the IEB database for validating responses. We also included attitudinal items without available validation data to study the effects of the interviewing techniques on these responses. The full questionnaire, translated into English, is available at http://doku.iab.de/fragebogen/CIIV_Questionnaire_English_08052015.pdf. Data collection continued from March to October in 2014. Prior to the onset of data collection, an advance letter was mailed to the sampled persons indicating the general topic of the survey (“Working and Living in Germany”), stressing the importance of the study for describing this population, and promising the sampled persons a twenty-euro token of appreciation for participating. The advance letter had the exact same content for both CI and SI. We were unable to record the recruitment conversations given the resources available for this study and, therefore, could not determine whether interviewers gave equal weight to each of these study features during recruitment; different interviewers giving different weight to different features could potentially introduce nonresponse error variance. In total, n = 1,850 interviews were completed by the sixty interviewers (AAPOR RR1 = 25.7%), resulting in an average of roughly thirty-one interviews per interviewer (minimum = 5, maximum = 59). Our data collection therefore hit the aforementioned target of 1,800 respondents. The aforementioned West etal. (2017) study presented initial analyses of these data. This study sought to compare the overall IICs and response distributions for a total of fifty-five variables generated by this data collection, including the respondent reports on each survey item and selected survey paradata (e.g., interview duration or interviewer observations of response quality), between the CI and SI groups of interviewers and determine whether there were any consistent differences in the overall IICs (based on respondent measures). This study found that CI generally produced higher IICs (see figure 1), but substantial differences tended to be rare, with significantly higher IICs emerging for CI for only five of the items. For the one item of these five with administrative data available from the IEB database, the increase in the IIC was not found to produce an estimate with a higher estimated mean squared error (MSE) relative to SI. CI was also found to shift response distributions significantly and in the direction of higher data quality for fourteen of the fifty-five items, reflecting the general benefits of CI for response quality that have been reported previously in the literature. Importantly, this initial study primarily considered respondent reports only, and the present study sought to use the official IEB data on the sampling frame to study the decompositions of any observed interviewer variance in respondent reports for each interviewing technique. Figure 1. View largeDownload slide Scatter Plot of the Estimated IICs in Respondent Reports for the SI and CI Groups, for All 55 Variables Analyzed in West etal. (2017). Each point represents a unique variable, and the variable with the highest IICs in each group (upper-right corner of figure 1) was interview duration. Figure 1. View largeDownload slide Scatter Plot of the Estimated IICs in Respondent Reports for the SI and CI Groups, for All 55 Variables Analyzed in West etal. (2017). Each point represents a unique variable, and the variable with the highest IICs in each group (upper-right corner of figure 1) was interview duration. Especially relevant to the present study were eleven survey items with responses that could be validated using administrative information on the IEB sampling frame. These eleven items are listed below, in addition to our expectations regarding the primary sources of interviewer variance in respondent reports on each item based on prior research. We note that we cannot generate expectations about the primary sources of interviewer variance for each technique given the lack of prior work in this area, resulting in the more general expectations below: Respondent birth year (expectation: nonresponse error variance among interviewers, given the simplicity of the measure; West and Olson 2010; West etal. 2013) Any unemployment since January 1, 2013 (expectation: a mixture of nonresponse error variance and measurement error variance among interviewers; West etal. 2013) Months of employment since January 1, 2013 (expectation: measurement error variance among interviewers, given possible recall difficulty and differential assistance of interviewers with recall tasks; Bruckmeier, Müller, and Riphahn 2015; Sayles etal. 2010) Months of unemployment since January 1, 2013 (expectation: measurement error variance; Bruckmeier etal. 2015; Sayles etal. 2010) Most recent gross monthly income (zero if unemployed since January 1, 2013, and natural-log transformed) (expectation: a mixture of nonresponse error variance and measurement error variance, given the complexity of the measure; Bailar, Bailey, and Stevens 1977) Gross annual income in 2013 (zero if unemployed since January 1, 2013, and natural-log transformed) (expectation: a larger contribution of measurement error variance than for most recent monthly income, given all possible sources of income over the course of a year; Bailar etal. 1977) Number of marginal (“mini”) jobs, defined by short-term employment on a fixed contract that does not exceed two months or more than fifty days in a calendar year, in the past five years (expectation: measurement error variance, given the complex definition of a “mini” job and the length of the recall period; Sayles etal. 2010) Months employed in the last twenty years (expectation: measurement error variance, given the length of the recall period and opportunities for interviewers to assist; Sayles etal. 2010) Longest uninterrupted period of employment in the last twenty years (in months) (expectation: same as above) Number of times registered as unemployed in the last twenty years (expectation: same as above) Months unemployed in the last twenty years (expectation: same as above) In the present study, we first identified which of these variables presented evidence of significant interviewer variance in respondent reports. We then analyzed the IEB data (referred to as “record values”) to understand the primary sources of this variance and whether they were consistent with the expectations outlined above. 2.3 Interviewer Evaluation To enable more in-depth analyses of the implementation of SI and CI in the field, the interviewers asked persons agreeing to participate in the survey for permission to record the entire interview. In total, 69.9% of all respondents agreed to have the entire interview recorded. All of the recordings that we analyzed contained the respondent’s affirmative consent to be recorded and did not include any identifying information. The field supervisors provided the interviewers with bi-weekly feedback on their performance based on these recordings and reported that the performance of selected interviewers improved (i.e., the interviewers became better at using their assigned technique) after receiving this feedback. Overall, when analyzing a subsample of these recordings in detail for interviewers in each of the two groups, we found that the CI and SI techniques were implemented correctly and consistently (Mittereder, Durow, West, Kreuter, and Conrad forthcoming). 2.4 Analytic Approach For each of the eleven questions with available validation data on the sampling frame and for each technique, we applied the multi-step methodology used in previous decomposition studies by West and Olson (2010) and West etal. (2013). In each step, we fitted a multilevel model (e.g., multilevel logistic models for questions with two possible responses) appropriate for the type of question that included fixed area effects (treating the areas as fixed characteristics of the interviewers, given that hypothetical replications of this study would use the same fifteen areas) and random interviewer effects (capturing effects unique to the interviewers and not the areas). The variances of the random interviewer effects were allowed to vary depending on the interviewing technique (CI or SI). This makes the analytic approach used for this study unique relative to previous decomposition studies, in that we can consider technique-specific decompositions of the interviewer variance. The four modeling steps were as follows: Estimate the interviewer variance in the means of record values for the randomly assigned sample cases (i.e., sampling error variance, expected to be negligible). Estimate the interviewer variance in the means of record values for respondents, considering unit nonresponse only (i.e., nonresponse error variance, assuming no sampling error variance). Estimate the interviewer variance in the means of record values for respondents, considering both unit and item nonresponse. Estimate the total interviewer variance in the means of reported values for respondents (i.e., a combination of measurement error variance and nonresponse error variance). Given successful interpenetrated assignment, we would expect the IIC for each technique in step 1) to be zero. The difference in the estimates between steps 4) and 3) for a given technique represents an estimate of the measurement error variance introduced by interviewers, assuming a negligible covariance between nonresponse error and measurement error among interviewers (West and Olson 2010). Given the interpenetrated design used in this study (conditional on the areas), this four-step methodology enables us to compare the variance components among interviewers arising from sampling, unit nonresponse, item nonresponse, and measurement for each interviewing technique. This methodology also allows us to test whether the variance components at each step (for each technique) are greater than zero using appropriate likelihood ratio tests (Zhang and Lin 2008). The fitted models also facilitated comparisons of the interviewer variance components arising at a given step between the two techniques. For all binary dependent variables yij, the models used had the form defined in (1) below: ln⁡(P(yij=1)P(yij=0))=β0+β1I[CIi=1]+∑p=215βpI[AREAi=p]+u1iI[CIi=1]+u2iI[SIi=1]u1i∼N(0,τCI2), u2i∼N(0,τSI2) (1) In (1), i denotes an interviewer and j a respondent, and the model includes a fixed intercept, a fixed effect of the CI technique, fourteen fixed area effects, and random interviewer effects specific to a given technique (u1i for CI and u2i for SI). We remind readers that the interviewers were, by design, not allowed to work in multiple randomly sampled areas, so we did not employ a cross-classified multilevel model that would be appropriate for these types of crossed designs (Vassallo etal. 2016). We also note that the variance of the random interviewer effects at each step is allowed to vary depending on the technique. The two interviewer components of variance for a given step were formally compared using the likelihood ratio testing approach outlined by West and Elliott (2014), enabling us to determine whether one technique resulted in more interviewer variability at a given step. Models of the form in (1) were estimated using adaptive Gauss-Hermite quadrature, as implemented in Stata’s melogit command (Version 14.1). Intra-interviewer correlations (IICs) were computed for each technique based on the underlying logistic distribution; for example, ρint⁡,CI=τCI2τCI2+π2/3. For all continuous measures, models of the form in (2) were fitted at each step using restricted maximum likelihood estifmation, as implemented in the mixed command in Stata: yij=β0+β1I[CIi=1]+∑p=215βpI[AREAi=p]+u1iI[CIi=1]+u2iI[SIi=1]+ɛiju1i∼N(0,τCI2), u2i∼N(0,τSI2), ɛij∼N(0,σCI2) if CIi=1, ɛij∼N(0,σSI2) if SIi=1 (2) We note that the model in (2) also allows for a unique residual variance for each technique as well, enabling the estimation of technique-specific IICs (e.g., ρint⁡,CI=τCI2τCI2+σCI2). We provide examples of the Stata code used to fit the models and the resulting output in the online supplementary material. 3 RESULTS For three of the eleven questions, we found evidence of at least marginally significant interviewer variance (P < 0.10) in respondent reports for one of the techniques. Less stringent criteria for significance has been suggested previously (e.g., P < 0.25 in Kish 1962), but we sought to identify items subject to substantial interviewer variance, given our objective of producing a reliable decomposition of that variance. The existence of this variance for three of these items is notable; if null hypotheses that interviewer variance components were equal to zero held for each of the eleven questions, we would have expected to see significant interviewer variance for only one of the eleven questions by random chance alone when using likelihood ratio tests with a 5% significance level. The three survey questions included respondent birth year (CI: IIC = 0.025, P = 0.041), 2013 gross annual income (SI: IIC = 0.021, P = 0.072), and longest uninterrupted period of employment in the past twenty years (CI: IIC = 0.066, P < 0.001). We therefore focus on decompositions of the interviewer variance for these three survey questions.4 First considering respondent birth year, table 1 shows the relative decompositions of the interviewer variance for each technique, in addition to tests of significance comparing the variance components at each step. Each row of table 1 presents the estimated interviewer variance components for SI (column 2) and CI (column 3) for the values analyzed at each specific step of the decomposition process (noted in Column 1). So, for example, we see in row 2 of table 1 that when analyzing the record values of respondents and considering unit nonresponse only, there is no evidence of interviewer variance in the SI group, but there is significant evidence of interviewer variance in the CI group (estimated interviewer variance component = 3.130, likelihood ratio test P < 0.05). Table 1. Decomposition of the Total Interviewer Variance in Respondent Reports of Birth Years for SI and CI (see figure 2) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance Note.—**p < 0.05, based on a likelihood ratio test; NS = not significant. Table 1. Decomposition of the Total Interviewer Variance in Respondent Reports of Birth Years for SI and CI (see figure 2) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance Note.—**p < 0.05, based on a likelihood ratio test; NS = not significant. Table 1 therefore presents no evidence of variance among the interviewers in either group in terms of mean birth year based on the full sample assignments, as expected based on the interpenetrated design. However, in the CI group, there is evidence of a substantial increase in the interviewer variance in mean birth years (based on the administrative data) when considering respondents only, suggesting significant nonresponse error variance (P = 0.045). This result hardly changes when including the one additional respondent that refused to provide their birth year in the survey. This variance then persists in the CI group when considering interviewer variance in the respondent reports, where measurement error variance would not be expected for a simple factual question like birth year. These results clearly show that the overall interviewer variance in mean birth years in the CI group was arising from nonresponse error variance, consistent with previous research (West and Olson 2010, West etal. 2013). This decomposition is illustrated in figure 2. Figure 2. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in Respondent Birth Years that Was Introduced in the CI Group (Top Row) at the Recruitment Stage (Based on Unit Nonresponse). Figure 2. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in Respondent Birth Years that Was Introduced in the CI Group (Top Row) at the Recruitment Stage (Based on Unit Nonresponse). Next, we consider the decomposition of interviewer variance for 2013 gross annual income (log-transformed) in table 2. Table 2 demonstrates that sampling error variance and nonresponse error variance (based on either unit nonresponse only or both unit and item nonresponse) were not issues in either of the groups, and that measurement error variance was the primary source of the marginally significant interviewer variance in the SI group. This finding, illustrated in figure 3, is largely consistent with the results presented by Bailar etal. (1977). Table 2. Decomposition of the Total Interviewer Variance in Log-Transformed Reports of Gross Annual income in 2013 for SI and CI (see figure 3) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None Note.—*p < 0.10, based on a likelihood ratio test; NS = not significant. Table 2. Decomposition of the Total Interviewer Variance in Log-Transformed Reports of Gross Annual income in 2013 for SI and CI (see figure 3) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None Note.—*p < 0.10, based on a likelihood ratio test; NS = not significant. Figure 3. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in 2013 Gross Annual Income Values (Log-Transformed) that Was Introduced in the SI Group (Bottom Row) at the Measurement Stage. Figure 3. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in 2013 Gross Annual Income Values (Log-Transformed) that Was Introduced in the SI Group (Bottom Row) at the Measurement Stage. Finally, table 3 provides the same decomposition results for the longest uninterrupted period of employment in the past twenty years. Table 3 presents evidence of nonresponse error variance in the SI group, given evidence of successful interpenetration for each technique. The estimated interviewer variance component increased substantially when considering record values for respondents only, ultimately increasing by more than 1200% (P = 0.059) when also including the seventeen respondents who did not answer this question on the survey (with the added item nonresponse increasing the interviewer variance slightly). These increases were offset by the eventual respondent reports, which tended to be closer to the overall mean. (We elaborate on this finding in the discussion section below.) In the CI group, on the other hand, there was negligible evidence of interviewer variance based on the record values of the recruited respondents. However, the interviewer variance in the CI group increased substantially at the measurement stage, to the point where the estimated IIC was 0.066 and the total interviewer variance was significant at the 1% level. Further, the difference in the total interviewer variance between the two techniques was also significant at the 1% level. This finding is largely consistent with the results of Sayles etal. (2010), who suggested that EHC approaches may increase interviewer variance in a modest fashion for recall items. Figure 4 illustrates these patterns in interviewer variance for each technique. Table 3. Decomposition of the Total Interviewer Variance in Reports of Longest Uninterrupted Period of Employment in the Past Twenty Years for the SI and CI Groups (see figure 4) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance Note.—*p < 0.10, ***p < 0.01, based on a likelihood ratio test; NS = not significant. Table 3. Decomposition of the Total Interviewer Variance in Reports of Longest Uninterrupted Period of Employment in the Past Twenty Years for the SI and CI Groups (see figure 4) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance Note.—*p < 0.10, ***p < 0.01, based on a likelihood ratio test; NS = not significant. Figure 4. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in the Longest Uninterrupted Period of Continuous Employment in the Past Twenty Years that Was Introduced in the CI Group (Top Row) at the Measurement Stage. Figure 4. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in the Longest Uninterrupted Period of Continuous Employment in the Past Twenty Years that Was Introduced in the CI Group (Top Row) at the Measurement Stage. In summary, given evidence of successful interpenetrated sampling, we found evidence of: Substantial nonresponse error variance among conversational interviewers in the ages of recruited respondents (figure 2); Marginal evidence of measurement error variance among standardized interviewers for reports of annual income (figure 3); Marginal evidence of nonresponse error variance among standardized interviewers for reports of longest uninterrupted employment periods in the past twenty years (which was ultimately reduced at the measurement stage; see figure 4); and Substantial measurement error variance among conversational interviewers in terms of the longest uninterrupted employment periods (figure 4). We also found no evidence of item nonresponse contributing to the overall nonresponse error variance among interviewers. 4. DISCUSSION 4.1 Summary of Results and Suggestions for Practice The results of this study provide empirical support for two main conclusions: Measurement error variance was an important contributor to the total interviewer variance for each interviewing technique, not just CI. Nonresponse error variance among interviewers also occurs for each technique, and in some cases, increases in interviewer variance components at the recruitment stage may ultimately be “offset” by respondent reports that are closer to the overall mean. We found evidence of significant interviewer variance for three out of eleven questions with available administrative data, with measurement error variance tending to be a larger source of the total interviewer variance for two of the three questions. These questions included gross annual income (in 2013) and reports of the longest uninterrupted period of employment in the past twenty years. Interviewers may struggle with correctly communicating the meaning of essential concepts underlying the reporting of gross annual income, including the role of bonuses, overtime payments, etc., and SI may limit the ability of interviewers to correctly explain these ideas. In questions that require a respondent to process a long history of events, less structured approaches like EHC interviewing (e.g., Sayles etal. 2010) may ultimately provide higher quality responses on average. However, these less structured approaches also introduce the possibility of increased interviewer variance if they are applied in an uneven fashion across interviewers. Our results suggest that conversational interviewers may have been taking different approaches to helping respondents think about this history, and some approaches may have ultimately been misleading. Finally, given the clear evidence of nonresponse error variance among interviewers in respondent age found here, several studies have now suggested that interviewers may recruit respondents with different ages (West and Olson 2010; West etal. 2013). This has critical implications for surveys collecting measures that are highly correlated with age, and we discuss these further below. The remaining eight measures that did not present evidence of significant interviewer variance at any step of the analysis generally required respondents to think about more recent events (e.g., employment since January 1, 2013) or historical events that may have been easier to remember because they had a larger impact on a person’s life than simple employment (e.g., number of times registered as unemployed in the last twenty years). It could be the case that these survey questions were somewhat “easier” cognitively and did not introduce as many opportunities for the interviewers to affect the measurement process. In light of these results, strategies for reducing measurement error variance deserve continued attention and consideration. For example, interviews could be audio recorded (with respondent consent), and the recordings of specific interviewers with respondent means that deviate significantly from the overall respondent means in models for particular survey questions5 (see figures 3 and 4, for example) could be analyzed to understand what those interviewers are doing differently. While we monitored audio recordings of completed interviews to ensure that our interviewers were implementing their assigned techniques correctly, this was a resource-intensive process, and we were only able to fully code the interactions in a small sample of completed interviews (Mittereder etal., forthcoming). Consistent analysis and monitoring of the interviewer effects associated with particular survey items would enable survey managers to identify those interviewers producing extreme deviations, and the coding of the audio recordings could then be much more focused on the problematic interviewers and the survey items in question. One-on-one training could then be implemented to correct any issues discovered in the recordings (e.g., failing to probe appropriately or using incorrect definitions of the concepts being measured). Importantly, the nonresponse error variance found in this study was not specific to a given interviewing technique. This is not entirely surprising; while we entertained the possibility that interviewers trained to use conversational interviewing might also vary in the types of individuals that they recruit, one would not generally expect the measurement technique that an organization trains an interviewer to use to affect recruitment outcomes. However, when we consider this finding alongside similar findings in prior studies of this type (West and Olson 2010; West etal. 2013), nonresponse error variance among interviewers does seem to be a general phenomenon that needs careful monitoring (especially in terms of the ages of recruited individuals). Different interviewers may emphasize different aspects of a study design (incentive, topic, sponsor, etc.) during recruitment, and this may be another source of nonresponse error variance, regardless of the interviewing technique being employed. We could not test this here, but this is certainly an open question for future research. Furthermore, the “cancelling out” of nonresponse error variance by respondent reports demonstrated in this study may not be ideal from a data quality perspective: if the mean respondent reports for different interviewers are quite similar but systematically different from the true mean, interviewer variance may be reduced, but the overall survey estimate may be biased.6 In addition, variable errors in the respondent reports could have a negative effect on the quality of estimates of regression coefficients. If the longest uninterrupted period of employment measure was used as a predictor variable in a regression model for some other economic outcome of interest, the absence of apparent interviewer variance in the respondent reports in the SI condition would not tell the whole story about the measurement problems for this variable. Given the evidence of nonresponse error variance in the SI condition, if respondents are providing reports with variable error that tend to be similar to the overall respondent mean and reduce the nonresponse error variance, these variable errors could attenuate estimates of regression coefficients toward zero. In the absence of record values on a sampling frame, survey managers could monitor the variability among interviewers, prior to measurement, in the values of “relevant” auxiliary variables and paradata (i.e., auxiliary information known to be correlated with the key variables) for recruited cases, and possibly intervene if particular interviewers seem to have unusual distributions. We note that this is a different strategy from simply intervening with interviewers who have low response rates. For example, if a survey organization linked the ages of household heads from a commercial database to sampled addresses (e.g., West, Wagner, Gu, and Hubbard 2015), and age was an important correlate of several key items in a given survey (e.g., self-rated health), interviewer variance in the ages of recruited respondents could be analyzed, and managers could intervene if particular interviewers seem to be primarily recruiting persons of a certain age. 4.2 Future Research Directions We first note that the decomposition approach used in this study relied on strong assumptions (e.g., interviewer-specific measurement errors and nonresponse errors are independent). Future research needs to focus on methods for decomposing interviewer variance that require fewer assumptions and allow these error sources to covary. West and Elliott (2015) have proposed one such idea based on multilevel modeling and multiple imputation, and future work in this area needs to further consider more elegant modeling approaches that relax these assumptions. Future studies should also continue to explore the “liking” hypothesis (e.g., older interviewers tend to recruit older respondents, which has important training implications) and examine whether interviewer characteristics effectively account for any observed nonresponse error variance. We only had access to a limited number of interviewer-level covariates in this study [prior experience with the PASS study (yes/no), a related study in Germany collecting similar types of measures (see Trappmann, Beste, Bethmann, and Müller 2013), age, and gender]. The randomized design of this study ensured that these characteristics were balanced between the two interviewing techniques. We added fixed effects of these covariates (with age centered at the mean of 57.26) to the four models fitted to each of the three variables analyzed in this study and found no evidence of the fixed effects being significantly different from zero in any of the models. Many of our interviewers were older and male, which limited the predictive utility of these variables, and several interviewers had the previous experience indicated above. While “liking” was not apparent here, it has been found in other studies (West and Blom 2017) and needs continued consideration. Finally, additional replications of this study would also be welcome for further understanding this problem. The results of this study are specific to one population (employed persons in Germany with complex employment histories), one questionnaire (mainly on employment history), and one data collection organization (infas). This prevents one from predicting whether a replication of this study in a different population and/or context would produce similar findings, motivating future replications. The specific population that we surveyed may weigh the survey participation decision and the topic of our survey quite differently than other populations (e.g., unemployed persons); this population also likely has unique features in terms of age (younger) and education (higher) relative to other populations. These differences have important implications for the nonresponse and measurement errors studied here, in terms of the recruitment process (where timing and the ages of the interviewers may be important factors), the topic of our survey, and the complexity of the questions being asked. We also learned during interviewer training that CI is essentially the norm for face-to-face surveys in Germany, and the interviewers actually found the SI training to be more difficult and unconventional. Interviewers in other countries (such as the United States) who may not be as comfortable with using CI may therefore apply this technique differently, possibly introducing more variance at one of the stages analyzed here, but this is difficult to predict in the absence of any replications. Additional replications of this study in other populations and/or contexts would help to build a set of empirical evidence regarding the interviewer variance introduced by these two techniques across different cultures and survey topics. Footnotes 1. In general, the more clarification that conversational interviewers provide, the longer the interviews take (Schober, Conrad, and Fricker 2004). 2. The overall survey included additional items that were not related to employment history, but we focus on these types of items in this study given the available validation data on the sampling frame. 3. For more information on the IEB database, please visit http://fdz.iab.de/en/FDZ_Individual_Data/Integrated_Employment_Biographies.aspx 4. Estimates of the fixed effects in the final models are available in the online supplementary materials. 5. One could identify interviewers with extreme deviations by computing predicted values of the random interviewer effects (Empirical Best Linear Unbiased Predictors, or EBLUPs) based on the estimated parameters in multilevel models fitted to the survey reports and testing whether the individual random effects are significantly different from zero. This approach is implemented, for example, in PROC MIXED of the SAS/STAT software, by adding the SOLUTION option to the RANDOM statement. 6. Additional analyses reported in West etal. (2017) suggest that this was not the case in this study. The “cancelling out” demonstrated here therefore suggests that interviewers with extreme means based on recruited individuals simply collected responses closer to the overall mean. REFERENCES Ackermann-Piek D. , Massing N. ( 2014 ), “ Interviewer Behavior and Interviewer Characteristics in PIAAC Germany ,” Methods, Data, Analyses , 8 , 199 – 222 . Bailar B. , Bailey L. , Stevens J. ( 1977 ), “ Measures of Interviewer Bias and Variance ,” Journal of Marketing Research , 14 , 337 – 343 . Google Scholar CrossRef Search ADS Belli R. F. ( 1998 ), “ The Structure of Autobiographical Memory and the Event History Calendar: Potential Improvements in the Quality of Retrospective Reports in Surveys ,” Memory , 6 , 383 – 406 . Google Scholar CrossRef Search ADS PubMed Belli R. F. , Lee E. H. , Stafford F. P. , Chou C.-H. ( 2004 ), “ Calendar and Question-list Survey Methods: Association between Interviewer Behaviors and Data Quality ,” Journal of Official Statistics , 20 , 185 – 218 . Belli R. F. , Shay W. L. , Stafford F. P. ( 2001 ), “ Event History Calendars and Question List Surveys: A Direct Comparison of Interviewing Methods ,” Public Opinion Quarterly , 65 , 45 – 74 . Google Scholar CrossRef Search ADS PubMed Bruckmeier K. , Müller G. , Riphahn R. T. ( 2015 ), “ Survey Misreporting of Welfare Receipt: Respondent, Interviewer, and Interview Characteristics ,” Economics Letters , 129 , 103 – 107 . Google Scholar CrossRef Search ADS Campanelli P. , Sturgis P. , Purdon S. ( 1997 ), Can You Hear Me Knocking: An Investigation into the Impact of Interviewers on Survey Response Rates , London : SCPR Survey Methods Centre . Conrad F. G. , Schober M. F. ( 2000 ), “ Clarifying Question Meaning in a Household Telephone Survey ,” Public Opinion Quarterly , 64 , 1 – 28 . Google Scholar CrossRef Search ADS PubMed Elliott M. R. , West B. T. ( 2015 ), “ Clustering by Interviewer’: A Source of Variance That Is Unaccounted for in Single-Stage Health Surveys ,” American Journal of Epidemiology , 182 , 118 – 126 . Google Scholar CrossRef Search ADS PubMed Fowler F. J. , Mangione T. W. ( 1990 ), Standardized Survey Interviewing: Minimizing Interviewer-Related Error , Newbury Park, CA : SAGE Publications . Google Scholar CrossRef Search ADS Groves R. M. ( 2004 ), “Chapter 8: The interviewer as a source of survey measurement error,” Survey Errors and Survey Costs (2nd ed.) , New York : Wiley-Interscience . Groves R. M. ( 2006 ), “ Nonresponse Rates and Nonresponse Bias in Household Surveys ,” Public Opinion Quarterly , 70 , 646 – 675 . Google Scholar CrossRef Search ADS Groves R. M. , Magilavy L. J. ( 1986 ), “ Measuring and Explaining Interviewer Effects in Centralized Telephone Surveys ,” Public Opinion Quarterly , 50 , 251 – 266 . Google Scholar CrossRef Search ADS Groves R. M. , McGonagle K. A. ( 2001 ), “ A Theory-Guided Interviewer Training Protocol Regarding Survey Participation ,” Journal of Official Statistics , 17 , 249 – 265 . Groves R. M. , Cialdini R. B. , Couper M. P. ( 1992 ), “ Understanding the Decision to Participate in a Survey ,” Public Opinion Quarterly , 56 , 475 – 495 . Google Scholar CrossRef Search ADS Groves R. M. , Peytcheva E. ( 2008 ), “ The Impact of Nonresponse Rates on Nonresponse Bias: A Meta-Analysis ,” Public Opinion Quarterly , 72 , 167 – 189 . Google Scholar CrossRef Search ADS Groves R. M. , Singer E. , Corning A. ( 2000 ), “ Leverage-Saliency Theory of Survey Participation ,” Public Opinion Quarterly , 64 , 299 – 308 . Google Scholar CrossRef Search ADS PubMed Haan M. , Ongena Y. , Huiskes M. ( 2013 ), “Interviewers’ Questions: Rewording Not Always a Bad Thing,” In: Interviewers' Deviations in Surveys: Impact, Reasons, Detection and Prevention , (edited by Winker P. , Menold N. , Porst R. ), Frankfurt : Peter Lang Academic Research . Henson R. , Cannell C. F. , Lawson S. ( 1976 ), “ Effects of Interviewer Style on Quality of Reporting in a Survey Interview ,” Journal of Psychology , 93 , 221 – 227 . Google Scholar CrossRef Search ADS Houtkoop-Steenstra H. ( 1995 ), “Meeting Both Ends: Between Standardization and Recipient Design in Telephone Survey Interviews,” Situated Order: Studies in the Social Organization of Talk and Embodied Activities , ed. ten Have P. , Psathas G. , pp. 91 – 107, Washington, DC : University Press of America . Hubbard F. , Antoun C. , Conrad F. G. ( 2012 ), “Conversational Interviewing, The Comprehension of Opinion Questions and Nonverbal Sensitivity,” Paper presented at the Annual Conference of the American Association for Public Opinion Research, Orlando, FL. Kish L. ( 1962 ), “ Studies of Interviewer Variance for Attitudinal Variables ,” Journal of the American Statistical Association , 57 , 92 – 115 . Google Scholar CrossRef Search ADS Mangione T. W. , Fowler F. J. , Louis T. A. ( 1992 ), “ Question Characteristics and Interviewer Effects ,” Journal of Official Statistics , 8 , 293 – 307 . Mittereder F. , Durow J. , West B. T. , Kreuter F. , Conrad F. G. (forthcoming), “ Interviewer-Respondent Interactions in Conversational and Standardized Interviewing ,” Field Methods . Morton-Williams J. ( 1993 ), Interviewer Approaches , Aldershot : Dartmouth Publishing Company Limited . Moser C. A. , Stuart A. ( 1953 ), “ An Experimental Study of Quota Sampling ,” Journal of the Royal Statistical Society, Series A , 116 , 349 – 405 . Google Scholar CrossRef Search ADS O’Muircheartaigh C. , Campanelli P. ( 1998 ), “ The Relative Impact of Interviewer Effects and Sample Design Effects on Survey Precision ,” Journal of the Royal Statistical Society, Series A , 161 , 63 – 77 . Google Scholar CrossRef Search ADS Peneff J. ( 1988 ), “ The Observers Observed: French Survey Researchers at Work ,” Social Problems , 35 , 520 – 535 . Google Scholar CrossRef Search ADS Sayles H. , Belli R. F. , Serrano E. ( 2010 ), “ Interviewer Variance Between Event History Calendar and Conventional Questionnaire Interviews ,” Public Opinion Quarterly , 74 , 140 – 153 . Google Scholar CrossRef Search ADS Schaeffer N. C. , Dykema J. , Maynard D. W. ( 2010 ), “Interviewers and Interviewing,” Handbook of Survey Research , ( 2nd ed .), ed. Wright James D. , Marsden Peter V. , pp. 437 – 470 , Bingley, UK : Emerald Group Publishing Limited . Schnell R. , Kreuter F. ( 2005 ), “ Separating Interviewer and Sampling-Point Effects ,” Journal of Official Statistics , 21 , 389 – 410 . Schober M. F. , Conrad F. G. ( 1997 ), “ Does Conversational Interviewing Reduce Survey Measurement Error? ,” Public Opinion Quarterly , 61 , 576 – 602 . Google Scholar CrossRef Search ADS Schober M. F. , Conrad F. G. , Dijkstra W. , Ongena Y. P. ( 2012 ), “ Disfluencies and Gaze Aversion in Unreliable Responses to Survey Questions ,” Journal of Official Statistics , 28 , 555 – 582 . Schober M. F. , Conrad F. G. , Fricker S. S. ( 2004 ), “ Misunderstanding Standardized Language in Research Interviews ,” Applied Cognitive Psychology , 18 , 169 – 188 . Google Scholar CrossRef Search ADS Snijkers G. , Hox J. J. , de Leeuw E. D. ( 1999 ), “ Interviewers' Tactics for Fighting Survey Nonresponse ,” Journal of Official Statistics , 15 , 185 – 198 . Stock J. S. , Hochstim J. R. ( 1951 ), “ A Method of Measuring Interviewer Variability ,” Public Opinion Quarterly , 15 , 322 – 334 . Google Scholar CrossRef Search ADS Stokes S. L. , Yeh M . ( 1988 ), “Searching for Causes of Interviewer Effects in Telephone Surveys,” Telephone Survey Methodology , ed. Groves R. M. , pp. 357 – 373, New York : John Wiley and Sons . Sturgis P. , Campanelli P. ( 1998 ), “ The Scope for Reducing Refusals in Household Surveys: An Invesigation Based on Transcripts of Tape-Recorded Doorstep Interactions ,” Journal of the Market Research Society , 40 , 121 – 139 . Suchman L. , Jordan B. ( 1990 ), “ Interactional Troubles in Face-to-Face Survey Interviews ,” Journal of the American Statistical Association , 85 , 232 – 241 . Google Scholar CrossRef Search ADS Trappmann M. , Beste J. , Bethmann A. , Müller G. ( 2013 ), “ The PASS Panel Survey After Six Waves ,” Journal of Labor Market Research , 46 , 275 – 281 . Google Scholar CrossRef Search ADS Tucker C. ( 1983 ), “ Interviewer Effects in Telephone Surveys ,” Public Opinion Quarterly , 47 , 84 – 95 . Google Scholar CrossRef Search ADS Vassallo R. , Durrant G. , Smith P. ( 2016 ), “ Separating Interviewer and Area Effects by Using a Cross-Classified Multilevel Logistic Model: Simulation Findings and Implications for Survey Designs ,” Journal of the Royal Statistical Society, Series A . doi: 10.1111/rssa.12206. West B. T. , Blom A. G. ( 2017 ), “ Explaining Interviewer Effects: A Research Synthesis ,” Journal of Survey Statistics and Methodology , 5 , 175 – 211 . West B. T. , Conrad F. G. , Kreuter F. , Mittereder F. ( 2017 ), “ Can Conversational Interviewing Improve Survey Response Quality Without Increasing Interviewer Effects? ,” Journal of the Royal Statistical Society, Series A . doi: 10.1111/rssa.12255. West B. T. , Elliott M. R. ( 2014 ), “ Frequentist and Bayesian Approaches for Comparing Interviewer Variance Components in Two Groups of Survey Interviewers ,” Survey Methodology , 40 , 163 – 188 . West B. T. , Elliott M. R. ( 2015 ), “New Methodologies for the Study and Decomposition of Interviewer Effects in Surveys,” Paper Presented at the Annual Meeting of the Statistical Society of Canada (SSC), Halifax, Nova Scotia, 6/17/15. West B. T. , Kreuter F. , Jaenichen U. ( 2013 ). “ Interviewer’ Effects in Face-to-face Surveys: A Function of Sampling, Measurement Error or Nonresponse? ,” Journal of Official Statistics , 29 , 277 – 297 . Google Scholar CrossRef Search ADS West B. T. , Olson K. ( 2010 ), “ How Much of Interviewer Variance is Really Nonresponse Error Variance? ,” Public Opinion Quarterly , 74 , 1004 – 1026 . Google Scholar CrossRef Search ADS West B. T. , Wagner J. , Gu H. , Hubbard F. ( 2015 ), “ The Utility of Alternative Commercial Data Sources for Survey Operations and Estimation: Evidence from the National Survey of Family Growth ,” Journal of Survey Statistics and Methodology , 3 , 240 – 264 . Google Scholar CrossRef Search ADS Zhang D. , Lin X. ( 2008 ), “Variance Component Testing in Generalized Linear Mixed Models for Longitudinal/Clustered Data and Other Related Topics,” Random Effect and Latent Variable Model Selection , ed. Dunson D. B. , Lecture Notes in Statistics, 192 , New York, NY : Springer . © The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Survey Statistics and Methodology Oxford University Press

Nonresponse and Measurement Error Variance among Interviewers in Standardized and Conversational Interviewing

Loading next page...
 
/lp/ou_press/nonresponse-and-measurement-error-variance-among-interviewers-in-Dju8ho2y2b
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
2325-0984
eISSN
2325-0992
D.O.I.
10.1093/jssam/smx029
Publisher site
See Article on Publisher Site

Abstract

Abstract Recent methodological studies have attempted to decompose the interviewer variance introduced in interviewer-administered surveys into its potential sources, using the Total Survey Error framework. These studies have informed the literature on interviewer effects by acknowledging interviewers’ dual roles as recruiters and data collectors, thus examining the relative contributions of nonresponse error variance and measurement error variance among interviewers to total interviewer variance. However, this breakdown may depend on the interviewing technique: some techniques emphasize behaviors designed to reduce variation in the answers collected by interviewers more so than other techniques. The question of whether the contributions of these error sources to total interviewer variance change for different interviewing techniques remains unanswered. Addressing this gap in knowledge has important implications for interviewing practice because the technique used could alter the relative contributions of variance in these error sources to total interviewer variance. This article presents results from an experimental study mounted in Germany that was designed to answer this question about two specific interviewing techniques. A national sample of employed individuals was first selected from a database of official administrative records, then randomly assigned to interviewers who themselves were randomized to conduct either conversational interviewing (CI) or standardized interviewing (SI), and finally measured face-to-face on a variety of cognitively challenging survey questions with official values also available for verifying the accuracy of responses. We find that although nonresponse error variance does exist among interviewers for selected measures (especially respondent age in the CI group), measurement error variance tends to be the more important source of total interviewer variance, regardless of whether interviewers are using CI or SI. 1. INTRODUCTION Interviewer variance continues to be a vexing problem for survey researchers. Survey estimates suffer from reduced quality when interviewer-specific estimates vary despite the assignment of cases with similar features to different interviewers. This variability reduces effective sample sizes in a manner similar to cluster sampling (Elliott and West 2015; Schnell and Kreuter 2005; Groves 2004; O’Muircheartaigh and Campanelli 1998), increasing the variance of estimates given fixed costs of data collection. More specifically, one can define a multiplicative “interviewer effect” on the variance of an estimator of the mean for a particular survey item as 1+(m¯−1)ρint⁡. In this well-known expression, m¯ corresponds to the average number of interviews completed across all interviewers, and ρint⁡ is the intra-interviewer correlation (IIC) of the responses to the survey item. The IIC is defined as the ratio of the between-interviewer variance in the responses to the sum of the between- and within-interviewer variance in the responses (i.e., the total variance in the responses). Thus, given an estimated IIC of 0.02 and an average of thirty interviews completed per interviewer, one would expect the variance of the estimated mean to be inflated by 58%. The development of strategies for minimizing interviewer variance requires the execution of carefully designed studies that isolate the effects of interviewers on survey measures of interest and identify the reasons for these effects. The implementation of these types of studies, however, is quite difficult in practice. Random assignment of sampled cases to interviewers (i.e., interpenetrated sampling), which is required for estimation of interviewer effects on a given survey outcome, is often not feasible for cost reasons. This introduces the possibility of variance among interviewers in the features of sampled cases prior to any recruiting or measurement efforts. Despite the extra effort and costs, researchers have been able to randomly assign cases to interviewers (or isolate interviewer effects analytically) in several studies, and the effects of interviewers on both response rates and survey measurement are now well documented (Schaeffer, Dykema, Maynard 2010; West and Blom, 2017). Several studies have suggested that interviewer variance in respondent reports may actually be arising from nonresponse error variance among interviewers, or the recruitment of individuals with different features by different interviewers (Stock and Hochstim 1951; Moser and Stuart 1953; Tucker 1983; Stokes and Yeh 1988). More recently, methodological studies have attempted to take advantage of the information in sampling frames about both respondents and nonrespondents, and decompose the variability among interviewers into three sources: 1) sampling error variance, introduced by a lack of interpenetrated assignment; 2) nonresponse error variance, or the variance among interviewers in the true values of the cases that they successfully recruit; and 3) measurement error variance, or the additional variance among interviewers introduced by the deviations of the responses that they collect from their true values (West and Olson 2010; West, Kreuter, and Jaenichen 2013). Although the literature has paid far more attention to measurement than nonresponse as the origin of interviewer variance, these recent decomposition studies have indicated that interviewer variance in respondent reports can in fact arise from nonresponse error variance. This decomposition of interviewer variance may also depend on the interviewing technique used. Standardized interviewing (SI) emphasizes behaviors designed to reduce variation between interviewers in the responses they collect (e.g., Fowler and Mangione 1990), while conversational interviewing (CI) encourages interviewers to say what is necessary to ensure that respondents understand survey questions as they are intended (e.g., Schober and Conrad 1997). No studies have performed comparative decompositions of the interviewer variance introduced by these two techniques. In theory, standardized interviewers provide all respondents with the exact same question stimuli and respond to respondents’ questions, concerns, and confusion with neutral probes (e.g., “let me repeat the question”). This technique standardizes question administration, and while interviewers are known to deviate from its strict principles in real-world settings (Peneff 1988; Houtkoop-Steenstra 1995; Schaeffer etal. 2010; Haan, Ongena, and Huiskes 2013; Ackermann-Piek and Massing 2014; see also Suchman and Jordan 1990), it theoretically removes the interviewer as a potential source of variable error during the measurement process, reducing the probability of interviewer effects due to the way questions are asked (Henson, Cannell, and Lawson 1976; Groves and Magilavy 1986; Fowler and Mangione 1990; Mangione, Fowler, and Louis 1992). Any measurement error variance among interviewers in SI would therefore be expected to arise from non-verbal behaviors, behaviors that deviate from SI protocol (e.g., excessive probing), or the effects of demographic characteristics of the interviewer during measurement (West and Blom 2017). Conversational interviewers, on the other hand, initially read questions exactly as worded and are then granted the flexibility to say whatever is necessary to ensure respondent comprehension if respondents misunderstand questions or express confusion or uncertainty about the questions. While CI can increase administration time compared to SI1, it has repeatedly been shown to increase the accuracy of responses to factual questions relative to SI (Schober and Conrad 1997; Conrad and Schober 2000; Schober, Conrad, and Fricker 2004; Schober, Conrad, Dijkstra, and Ongena 2012; Hubbard, Antoun, and Conrad 2012). In theory, one would therefore expect that CI will not introduce interviewer variance in factual questions either, provided that the interviewers are successful in consistently achieving accurate measurements. A recent study (West, Conrad, Kreuter and Mittereder, 2017) found that CI rarely increased overall interviewer variance relative to SI for a variety of survey items related to employment history and housing conditions and that these increases did not offset gains in the overall quality of estimates due to the increased response accuracy engendered by CI. Nevertheless, any interviewer variance introduced by either technique could still compromise the quality of selected survey estimates, making knowledge of whether it results from errors during recruitment (nonresponse) or interviewing (measurement) key to improving survey quality. No studies have addressed this gap in knowledge. A related interviewing approach that has been found to improve the accuracy of respondent recall in multiple studies is event history calendar (EHC) interviewing (e.g., Belli, Lee, Stafford, and Chou 2004; Sayles, Belli, and Serrano 2010). When survey respondents are asked to recall potentially complex autobiographical measures, an interviewer employing the EHC approach uses a flexible, conversational style to encourage the respondent to employ narrative use of retrieval cues available in autobiographical memory. Several studies have compared data collected using the EHC approach to data collected using a more standardized interviewing approach, and these studies have repeatedly demonstrated the benefits of the EHC approach for response accuracy (Belli 1998; Belli, Shay, and Stafford 2001; Belli etal. 2004; Sayles etal. 2010). Only one study to date has specifically considered the interviewer variance introduced by this more flexible approach relative to a more standardized approach (Sayles etal. 2010), finding evidence of modest increases in interviewer variance due to the use of the EHC approach. Much like the CI literature, no studies to date have considered the sources of any interviewer variance introduced by the EHC approach. Although techniques like CI and the EHC approach are designed to reduce measurement error, it is possible that the additional flexibility granted to interviewers using these approaches could result in measurement error variance accounting for a larger portion of the total interviewer variance introduced by CI, relative to SI. This may be due to uneven implementation of the techniques across interviewers; some interviewers may go off on tangents, and others may present respondents with incorrect definitions of key concepts or make erroneous statements when attempting to address respondent confusion. Hypotheses related to the expected contributions of nonresponse error variance for these two techniques, on the other hand, are harder to formulate given the lack of prior work in this area. One possibility is that some of the conversational interviewers who are trained to say what is required to ensure that respondents understand questions as intended during measurement may also improvise (tailor) more when securing respondent cooperation, thus introducing interviewer variance in nonresponse rates (e.g., Groves, Cialdini, and Couper 1992; Morton-Williams 1993; Campanelli, Sturgis, and Purdon 1997; Sturgis and Campanelli 1998; Snijkers, Hox, and de Leeuw 1999; Groves and McGonagle 2001). For example, some conversational interviewers may emphasize certain features of the survey design (e.g., incentives, timing, etc.) more so than others when tailoring their recruitment efforts to respondent concerns. While this differential emphasis may have the potential to mitigate the negative leverage of specific design features on the decision to participate for certain respondents, following the ideas of leverage-saliency theory (Groves, Singer, and Corning 2000), it could lead to nonresponse error variance among interviewers if the features emphasized by some conversational interviewers tend to be strongly linked to key survey measures. For instance, a conversational interviewer who tends to emphasize the incentives when convincing people to participate may end up only recruiting persons with lower socio-economic status who are not as interested in the topic, and these individuals may respond on key measures quite differently from individuals recruited by another interviewer emphasizing different design features. On the other hand, one could also argue that higher rates of tailoring among conversational interviewers could lead to more systematically successful recruitment, and interviewers using a more standardized recruitment approach (which may work well for some sampled cases but not others) may vary more in terms of achieved response rates. Regardless of the (unknown) effects of these different measurement techniques on recruitment practices, whether variability in response rates across interviewers leads to increased interviewer variance in nonresponse errors is also currently unknown (Groves 2006; Groves and Peytcheva 2008). The concept of “liking” could also lead to nonresponse error variance, regardless of whether CI or SI is used. Interviewers may tend to recruit respondents with similar socio-demographic features. For instance, younger interviewers may tend to recruit younger respondents (West and Blom 2017). This variance among interviewers in the types of respondents recruited could introduce nonresponse error variance if these socio-demographic features are correlated with the survey measures of interest. It may be the case that during recruitment, more conversational interviewers tend to draw more attention to their socio-demographic features while tailoring and maintaining engagement, but this is presently unknown. Initial work is needed to understand the extent to which nonresponse error variance exists for either interviewing technique and whether observable features of the interviewers play a role in introducing this type of variance. To date, these alternative hypotheses have not been tested for the CI and SI techniques. If CI and SI vary in terms of the relative contributions of nonresponse error variance and measurement error variance to total interviewer variance, then deployment of one technique or the other will likely require more technique-specific training than is now common in order to reduce both types of error. Researchers could eventually design follow-up studies to test the effects of this type of training on the measurement and nonresponse error variance introduced by a particular technique, but an essential first step is the quantification and comparison of the contributions of nonresponse error variance and measurement error variance to the total interviewer variance introduced by the CI and SI techniques. We base these comparisons on data collected from a large national sample in Germany, where interviewers were randomly assigned to use either CI or SI. The randomly sampled persons assigned to each interviewer were measured on a variety of survey items, some of which had validation information available on the sampling frame. We first identify items with significant interviewer variance overall and then decompose this variance into the sources mentioned above. 2. METHODS 2.1 Sample Size Considerations Given our specific objectives, we wanted to design a study that would have enough power to detect differences in the variance components used to compute IICs between standardized and conversational interviewers as being significant. We therefore began by performing a customized Monte Carlo simulation study to assess the power of alternative designs to detect a range of differences in variance components (see the SAS code in the online supplementary material). A review of the literature on interviewer effects (West and Olson 2010; Schnell and Kreuter 2005; O’Muircheartaigh and Campanelli 1998) suggests that most IICs will range from 0.01 to 0.12 in face-to-face surveys, with many falling below 0.02. Furthermore, in our recent work analyzing data on survey items from a face-to-face study in Germany (West, Kreuter, and Jaenichen 2013) with subject matter similar to that considered in the present study (employment histories), we found that these IICs ranged from below 0.01 to approximately 0.09. An IIC of 0.09 would quadruple the variance of an estimate, reducing the effective sample size by 75%. We used these earlier results to perform power calculations for this study, ensuring that we would have enough power to detect differences of this magnitude for both continuous and binary survey measures of employment history.2 In the simulation studies, we found that having thirty interviewers measuring a continuous item for each of the two techniques (sixty interviewers total) and thirty respondents for each interviewer (or 1,800 respondents in total) would yield approximately 80% power to detect a 6.6-fold difference in the between-interviewer variance components used to compute the IIC as significant, based on a likelihood ratio test (West and Elliott 2014), with a 5% level of significance. A difference of this size in the variance components falls within the aforementioned range of IICs (0.01 to 0.09) that we have seen in related studies with similar subject matter. Furthermore, we found that having thirty interviewers in each of the two experimental groups measuring a binary item and thirty respondents for each interviewer would yield approximately 82% power to detect a similar difference in the between-interviewer variance components, again using a likelihood ratio test with a 5% level of significance. We therefore based our data collection protocol on meeting these targets. 2.2 Sampling and Data Collection Next, we designed and conducted an in-person data collection in fifteen large areas in Germany. In each area, we selected a simple random sample of 480 currently employed adults who had experienced at least one period of unemployment (or “unemployment spell”) from the Integrated Employment Biographies (IEB) database, which contains official government information on employment histories.3 The overall sample size was thus n = 15 × 480 = 7,200, allowing for a 25% response rate given the aforementioned power analysis. Four professional interviewers were assigned to work in each of the fifteen areas (sixty interviewers total), and each was assigned 120 of the randomly sampled cases in total (i.e., interpenetrated sample assignment, conditional on the area). We note that the design of this study differs from most other studies of interviewer effects where interviewers work in multiple areas, in that interviewers were nested within the fifteen areas in Germany. More recent work has suggested that designs allowing interviewers to work in different areas introduce important sample size and sample assignment considerations (Vassallo, Durrant, and Smith 2016; e.g., how many areas should be worked by an interviewer for interpenetration to hold?), but this was not a feature of our design. We therefore conducted the simulation study described above, and the SAS code in the online supplementary material can be readily applied for similar calculations. Additional details related to the sampling used in this study can be found in West, Conrad, Kreuter, Mittereder (2017). Two randomly selected interviewers in each area were thoroughly trained in the use of CI, and the other two were similarly trained in the use of SI. Both groups of interviewers were trained to administer a 30-minute instrument in a face-to-face setting. This instrument included several questions that met at least one of the following four criteria: 1) the question had been tested and used in previous national surveys in Germany; 2) answering the question was judged to require complex cognitive processes, mainly related to employment history; 3) the question had historically required a large amount of interviewer-respondent interaction, according to German researchers; and 4) official administrative values were available in the IEB database for validating responses. We also included attitudinal items without available validation data to study the effects of the interviewing techniques on these responses. The full questionnaire, translated into English, is available at http://doku.iab.de/fragebogen/CIIV_Questionnaire_English_08052015.pdf. Data collection continued from March to October in 2014. Prior to the onset of data collection, an advance letter was mailed to the sampled persons indicating the general topic of the survey (“Working and Living in Germany”), stressing the importance of the study for describing this population, and promising the sampled persons a twenty-euro token of appreciation for participating. The advance letter had the exact same content for both CI and SI. We were unable to record the recruitment conversations given the resources available for this study and, therefore, could not determine whether interviewers gave equal weight to each of these study features during recruitment; different interviewers giving different weight to different features could potentially introduce nonresponse error variance. In total, n = 1,850 interviews were completed by the sixty interviewers (AAPOR RR1 = 25.7%), resulting in an average of roughly thirty-one interviews per interviewer (minimum = 5, maximum = 59). Our data collection therefore hit the aforementioned target of 1,800 respondents. The aforementioned West etal. (2017) study presented initial analyses of these data. This study sought to compare the overall IICs and response distributions for a total of fifty-five variables generated by this data collection, including the respondent reports on each survey item and selected survey paradata (e.g., interview duration or interviewer observations of response quality), between the CI and SI groups of interviewers and determine whether there were any consistent differences in the overall IICs (based on respondent measures). This study found that CI generally produced higher IICs (see figure 1), but substantial differences tended to be rare, with significantly higher IICs emerging for CI for only five of the items. For the one item of these five with administrative data available from the IEB database, the increase in the IIC was not found to produce an estimate with a higher estimated mean squared error (MSE) relative to SI. CI was also found to shift response distributions significantly and in the direction of higher data quality for fourteen of the fifty-five items, reflecting the general benefits of CI for response quality that have been reported previously in the literature. Importantly, this initial study primarily considered respondent reports only, and the present study sought to use the official IEB data on the sampling frame to study the decompositions of any observed interviewer variance in respondent reports for each interviewing technique. Figure 1. View largeDownload slide Scatter Plot of the Estimated IICs in Respondent Reports for the SI and CI Groups, for All 55 Variables Analyzed in West etal. (2017). Each point represents a unique variable, and the variable with the highest IICs in each group (upper-right corner of figure 1) was interview duration. Figure 1. View largeDownload slide Scatter Plot of the Estimated IICs in Respondent Reports for the SI and CI Groups, for All 55 Variables Analyzed in West etal. (2017). Each point represents a unique variable, and the variable with the highest IICs in each group (upper-right corner of figure 1) was interview duration. Especially relevant to the present study were eleven survey items with responses that could be validated using administrative information on the IEB sampling frame. These eleven items are listed below, in addition to our expectations regarding the primary sources of interviewer variance in respondent reports on each item based on prior research. We note that we cannot generate expectations about the primary sources of interviewer variance for each technique given the lack of prior work in this area, resulting in the more general expectations below: Respondent birth year (expectation: nonresponse error variance among interviewers, given the simplicity of the measure; West and Olson 2010; West etal. 2013) Any unemployment since January 1, 2013 (expectation: a mixture of nonresponse error variance and measurement error variance among interviewers; West etal. 2013) Months of employment since January 1, 2013 (expectation: measurement error variance among interviewers, given possible recall difficulty and differential assistance of interviewers with recall tasks; Bruckmeier, Müller, and Riphahn 2015; Sayles etal. 2010) Months of unemployment since January 1, 2013 (expectation: measurement error variance; Bruckmeier etal. 2015; Sayles etal. 2010) Most recent gross monthly income (zero if unemployed since January 1, 2013, and natural-log transformed) (expectation: a mixture of nonresponse error variance and measurement error variance, given the complexity of the measure; Bailar, Bailey, and Stevens 1977) Gross annual income in 2013 (zero if unemployed since January 1, 2013, and natural-log transformed) (expectation: a larger contribution of measurement error variance than for most recent monthly income, given all possible sources of income over the course of a year; Bailar etal. 1977) Number of marginal (“mini”) jobs, defined by short-term employment on a fixed contract that does not exceed two months or more than fifty days in a calendar year, in the past five years (expectation: measurement error variance, given the complex definition of a “mini” job and the length of the recall period; Sayles etal. 2010) Months employed in the last twenty years (expectation: measurement error variance, given the length of the recall period and opportunities for interviewers to assist; Sayles etal. 2010) Longest uninterrupted period of employment in the last twenty years (in months) (expectation: same as above) Number of times registered as unemployed in the last twenty years (expectation: same as above) Months unemployed in the last twenty years (expectation: same as above) In the present study, we first identified which of these variables presented evidence of significant interviewer variance in respondent reports. We then analyzed the IEB data (referred to as “record values”) to understand the primary sources of this variance and whether they were consistent with the expectations outlined above. 2.3 Interviewer Evaluation To enable more in-depth analyses of the implementation of SI and CI in the field, the interviewers asked persons agreeing to participate in the survey for permission to record the entire interview. In total, 69.9% of all respondents agreed to have the entire interview recorded. All of the recordings that we analyzed contained the respondent’s affirmative consent to be recorded and did not include any identifying information. The field supervisors provided the interviewers with bi-weekly feedback on their performance based on these recordings and reported that the performance of selected interviewers improved (i.e., the interviewers became better at using their assigned technique) after receiving this feedback. Overall, when analyzing a subsample of these recordings in detail for interviewers in each of the two groups, we found that the CI and SI techniques were implemented correctly and consistently (Mittereder, Durow, West, Kreuter, and Conrad forthcoming). 2.4 Analytic Approach For each of the eleven questions with available validation data on the sampling frame and for each technique, we applied the multi-step methodology used in previous decomposition studies by West and Olson (2010) and West etal. (2013). In each step, we fitted a multilevel model (e.g., multilevel logistic models for questions with two possible responses) appropriate for the type of question that included fixed area effects (treating the areas as fixed characteristics of the interviewers, given that hypothetical replications of this study would use the same fifteen areas) and random interviewer effects (capturing effects unique to the interviewers and not the areas). The variances of the random interviewer effects were allowed to vary depending on the interviewing technique (CI or SI). This makes the analytic approach used for this study unique relative to previous decomposition studies, in that we can consider technique-specific decompositions of the interviewer variance. The four modeling steps were as follows: Estimate the interviewer variance in the means of record values for the randomly assigned sample cases (i.e., sampling error variance, expected to be negligible). Estimate the interviewer variance in the means of record values for respondents, considering unit nonresponse only (i.e., nonresponse error variance, assuming no sampling error variance). Estimate the interviewer variance in the means of record values for respondents, considering both unit and item nonresponse. Estimate the total interviewer variance in the means of reported values for respondents (i.e., a combination of measurement error variance and nonresponse error variance). Given successful interpenetrated assignment, we would expect the IIC for each technique in step 1) to be zero. The difference in the estimates between steps 4) and 3) for a given technique represents an estimate of the measurement error variance introduced by interviewers, assuming a negligible covariance between nonresponse error and measurement error among interviewers (West and Olson 2010). Given the interpenetrated design used in this study (conditional on the areas), this four-step methodology enables us to compare the variance components among interviewers arising from sampling, unit nonresponse, item nonresponse, and measurement for each interviewing technique. This methodology also allows us to test whether the variance components at each step (for each technique) are greater than zero using appropriate likelihood ratio tests (Zhang and Lin 2008). The fitted models also facilitated comparisons of the interviewer variance components arising at a given step between the two techniques. For all binary dependent variables yij, the models used had the form defined in (1) below: ln⁡(P(yij=1)P(yij=0))=β0+β1I[CIi=1]+∑p=215βpI[AREAi=p]+u1iI[CIi=1]+u2iI[SIi=1]u1i∼N(0,τCI2), u2i∼N(0,τSI2) (1) In (1), i denotes an interviewer and j a respondent, and the model includes a fixed intercept, a fixed effect of the CI technique, fourteen fixed area effects, and random interviewer effects specific to a given technique (u1i for CI and u2i for SI). We remind readers that the interviewers were, by design, not allowed to work in multiple randomly sampled areas, so we did not employ a cross-classified multilevel model that would be appropriate for these types of crossed designs (Vassallo etal. 2016). We also note that the variance of the random interviewer effects at each step is allowed to vary depending on the technique. The two interviewer components of variance for a given step were formally compared using the likelihood ratio testing approach outlined by West and Elliott (2014), enabling us to determine whether one technique resulted in more interviewer variability at a given step. Models of the form in (1) were estimated using adaptive Gauss-Hermite quadrature, as implemented in Stata’s melogit command (Version 14.1). Intra-interviewer correlations (IICs) were computed for each technique based on the underlying logistic distribution; for example, ρint⁡,CI=τCI2τCI2+π2/3. For all continuous measures, models of the form in (2) were fitted at each step using restricted maximum likelihood estifmation, as implemented in the mixed command in Stata: yij=β0+β1I[CIi=1]+∑p=215βpI[AREAi=p]+u1iI[CIi=1]+u2iI[SIi=1]+ɛiju1i∼N(0,τCI2), u2i∼N(0,τSI2), ɛij∼N(0,σCI2) if CIi=1, ɛij∼N(0,σSI2) if SIi=1 (2) We note that the model in (2) also allows for a unique residual variance for each technique as well, enabling the estimation of technique-specific IICs (e.g., ρint⁡,CI=τCI2τCI2+σCI2). We provide examples of the Stata code used to fit the models and the resulting output in the online supplementary material. 3 RESULTS For three of the eleven questions, we found evidence of at least marginally significant interviewer variance (P < 0.10) in respondent reports for one of the techniques. Less stringent criteria for significance has been suggested previously (e.g., P < 0.25 in Kish 1962), but we sought to identify items subject to substantial interviewer variance, given our objective of producing a reliable decomposition of that variance. The existence of this variance for three of these items is notable; if null hypotheses that interviewer variance components were equal to zero held for each of the eleven questions, we would have expected to see significant interviewer variance for only one of the eleven questions by random chance alone when using likelihood ratio tests with a 5% significance level. The three survey questions included respondent birth year (CI: IIC = 0.025, P = 0.041), 2013 gross annual income (SI: IIC = 0.021, P = 0.072), and longest uninterrupted period of employment in the past twenty years (CI: IIC = 0.066, P < 0.001). We therefore focus on decompositions of the interviewer variance for these three survey questions.4 First considering respondent birth year, table 1 shows the relative decompositions of the interviewer variance for each technique, in addition to tests of significance comparing the variance components at each step. Each row of table 1 presents the estimated interviewer variance components for SI (column 2) and CI (column 3) for the values analyzed at each specific step of the decomposition process (noted in Column 1). So, for example, we see in row 2 of table 1 that when analyzing the record values of respondents and considering unit nonresponse only, there is no evidence of interviewer variance in the SI group, but there is significant evidence of interviewer variance in the CI group (estimated interviewer variance component = 3.130, likelihood ratio test P < 0.05). Table 1. Decomposition of the Total Interviewer Variance in Respondent Reports of Birth Years for SI and CI (see figure 2) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance Note.—**p < 0.05, based on a likelihood ratio test; NS = not significant. Table 1. Decomposition of the Total Interviewer Variance in Respondent Reports of Birth Years for SI and CI (see figure 2) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) < 0.01 3.130** p = 0.143 Estimated interviewer variance component, record values of respondents, unit and item nonresponse (n = 1,849) < 0.01 3.122** p = 0.141 Estimated interviewer variance component, reported values of respondents (n = 1,849) < 0.01 3.218** p = 0.137 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.025 Main source of overall interviewer variance in respondent reports None Nonresponse error variance Note.—**p < 0.05, based on a likelihood ratio test; NS = not significant. Table 1 therefore presents no evidence of variance among the interviewers in either group in terms of mean birth year based on the full sample assignments, as expected based on the interpenetrated design. However, in the CI group, there is evidence of a substantial increase in the interviewer variance in mean birth years (based on the administrative data) when considering respondents only, suggesting significant nonresponse error variance (P = 0.045). This result hardly changes when including the one additional respondent that refused to provide their birth year in the survey. This variance then persists in the CI group when considering interviewer variance in the respondent reports, where measurement error variance would not be expected for a simple factual question like birth year. These results clearly show that the overall interviewer variance in mean birth years in the CI group was arising from nonresponse error variance, consistent with previous research (West and Olson 2010, West etal. 2013). This decomposition is illustrated in figure 2. Figure 2. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in Respondent Birth Years that Was Introduced in the CI Group (Top Row) at the Recruitment Stage (Based on Unit Nonresponse). Figure 2. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in Respondent Birth Years that Was Introduced in the CI Group (Top Row) at the Recruitment Stage (Based on Unit Nonresponse). Next, we consider the decomposition of interviewer variance for 2013 gross annual income (log-transformed) in table 2. Table 2 demonstrates that sampling error variance and nonresponse error variance (based on either unit nonresponse only or both unit and item nonresponse) were not issues in either of the groups, and that measurement error variance was the primary source of the marginally significant interviewer variance in the SI group. This finding, illustrated in figure 3, is largely consistent with the results presented by Bailar etal. (1977). Table 2. Decomposition of the Total Interviewer Variance in Log-Transformed Reports of Gross Annual income in 2013 for SI and CI (see figure 3) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None Note.—*p < 0.10, based on a likelihood ratio test; NS = not significant. Table 2. Decomposition of the Total Interviewer Variance in Log-Transformed Reports of Gross Annual income in 2013 for SI and CI (see figure 3) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,154) < 0.01 < 0.01 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,846) < 0.01 < 0.01 NS Estimated interviewer variance component, Record values of respondents, unit and item nonresponse (n = 1,479) < 0.01 < 0.01 NS Estimated interviewer variance component, reported values of respondents (n = 1,479) 0.152* < 0.01 NS Estimated intra-interviewer correlation (IIC) in respondent reports 0.021 < 0.001 Main source of overall interviewer variance in respondent reports Measurement error variance None Note.—*p < 0.10, based on a likelihood ratio test; NS = not significant. Figure 3. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in 2013 Gross Annual Income Values (Log-Transformed) that Was Introduced in the SI Group (Bottom Row) at the Measurement Stage. Figure 3. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in 2013 Gross Annual Income Values (Log-Transformed) that Was Introduced in the SI Group (Bottom Row) at the Measurement Stage. Finally, table 3 provides the same decomposition results for the longest uninterrupted period of employment in the past twenty years. Table 3 presents evidence of nonresponse error variance in the SI group, given evidence of successful interpenetration for each technique. The estimated interviewer variance component increased substantially when considering record values for respondents only, ultimately increasing by more than 1200% (P = 0.059) when also including the seventeen respondents who did not answer this question on the survey (with the added item nonresponse increasing the interviewer variance slightly). These increases were offset by the eventual respondent reports, which tended to be closer to the overall mean. (We elaborate on this finding in the discussion section below.) In the CI group, on the other hand, there was negligible evidence of interviewer variance based on the record values of the recruited respondents. However, the interviewer variance in the CI group increased substantially at the measurement stage, to the point where the estimated IIC was 0.066 and the total interviewer variance was significant at the 1% level. Further, the difference in the total interviewer variance between the two techniques was also significant at the 1% level. This finding is largely consistent with the results of Sayles etal. (2010), who suggested that EHC approaches may increase interviewer variance in a modest fashion for recall items. Figure 4 illustrates these patterns in interviewer variance for each technique. Table 3. Decomposition of the Total Interviewer Variance in Reports of Longest Uninterrupted Period of Employment in the Past Twenty Years for the SI and CI Groups (see figure 4) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance Note.—*p < 0.10, ***p < 0.01, based on a likelihood ratio test; NS = not significant. Table 3. Decomposition of the Total Interviewer Variance in Reports of Longest Uninterrupted Period of Employment in the Past Twenty Years for the SI and CI Groups (see figure 4) SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance SI CI Test of difference in variance components Estimated interviewer variance component, record values of full sample (n = 7,199) 3.313 3.349 NS Estimated interviewer variance component, record values of respondents, unit nonresponse only (n = 1,850) 40.555* < 0.01 p = 0.159 Estimated interviewer variance component, Record Values of respondents, unit and item nonresponse (n = 1,833) 44.207* < 0.01 p = 0.135 Estimated interviewer variance component, reported values of respondents (n = 1,833) < 0.01 194.999*** p < 0.01 Estimated intra-interviewer correlation (IIC) in respondent reports < 0.001 0.066 Main source of overall interviewer variance in respondent reports None Measurement error variance Note.—*p < 0.10, ***p < 0.01, based on a likelihood ratio test; NS = not significant. Figure 4. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in the Longest Uninterrupted Period of Continuous Employment in the Past Twenty Years that Was Introduced in the CI Group (Top Row) at the Measurement Stage. Figure 4. View largeDownload slide Box Plots of Predicted Random Interviewer Effects (EBLUPs) at Each Stage of the Data Collection (Where an Interviewer with a Mean Equal to the Overall Mean Would Have an EBLUP of 0), Illustrating the Increased Interviewer Variance in the Longest Uninterrupted Period of Continuous Employment in the Past Twenty Years that Was Introduced in the CI Group (Top Row) at the Measurement Stage. In summary, given evidence of successful interpenetrated sampling, we found evidence of: Substantial nonresponse error variance among conversational interviewers in the ages of recruited respondents (figure 2); Marginal evidence of measurement error variance among standardized interviewers for reports of annual income (figure 3); Marginal evidence of nonresponse error variance among standardized interviewers for reports of longest uninterrupted employment periods in the past twenty years (which was ultimately reduced at the measurement stage; see figure 4); and Substantial measurement error variance among conversational interviewers in terms of the longest uninterrupted employment periods (figure 4). We also found no evidence of item nonresponse contributing to the overall nonresponse error variance among interviewers. 4. DISCUSSION 4.1 Summary of Results and Suggestions for Practice The results of this study provide empirical support for two main conclusions: Measurement error variance was an important contributor to the total interviewer variance for each interviewing technique, not just CI. Nonresponse error variance among interviewers also occurs for each technique, and in some cases, increases in interviewer variance components at the recruitment stage may ultimately be “offset” by respondent reports that are closer to the overall mean. We found evidence of significant interviewer variance for three out of eleven questions with available administrative data, with measurement error variance tending to be a larger source of the total interviewer variance for two of the three questions. These questions included gross annual income (in 2013) and reports of the longest uninterrupted period of employment in the past twenty years. Interviewers may struggle with correctly communicating the meaning of essential concepts underlying the reporting of gross annual income, including the role of bonuses, overtime payments, etc., and SI may limit the ability of interviewers to correctly explain these ideas. In questions that require a respondent to process a long history of events, less structured approaches like EHC interviewing (e.g., Sayles etal. 2010) may ultimately provide higher quality responses on average. However, these less structured approaches also introduce the possibility of increased interviewer variance if they are applied in an uneven fashion across interviewers. Our results suggest that conversational interviewers may have been taking different approaches to helping respondents think about this history, and some approaches may have ultimately been misleading. Finally, given the clear evidence of nonresponse error variance among interviewers in respondent age found here, several studies have now suggested that interviewers may recruit respondents with different ages (West and Olson 2010; West etal. 2013). This has critical implications for surveys collecting measures that are highly correlated with age, and we discuss these further below. The remaining eight measures that did not present evidence of significant interviewer variance at any step of the analysis generally required respondents to think about more recent events (e.g., employment since January 1, 2013) or historical events that may have been easier to remember because they had a larger impact on a person’s life than simple employment (e.g., number of times registered as unemployed in the last twenty years). It could be the case that these survey questions were somewhat “easier” cognitively and did not introduce as many opportunities for the interviewers to affect the measurement process. In light of these results, strategies for reducing measurement error variance deserve continued attention and consideration. For example, interviews could be audio recorded (with respondent consent), and the recordings of specific interviewers with respondent means that deviate significantly from the overall respondent means in models for particular survey questions5 (see figures 3 and 4, for example) could be analyzed to understand what those interviewers are doing differently. While we monitored audio recordings of completed interviews to ensure that our interviewers were implementing their assigned techniques correctly, this was a resource-intensive process, and we were only able to fully code the interactions in a small sample of completed interviews (Mittereder etal., forthcoming). Consistent analysis and monitoring of the interviewer effects associated with particular survey items would enable survey managers to identify those interviewers producing extreme deviations, and the coding of the audio recordings could then be much more focused on the problematic interviewers and the survey items in question. One-on-one training could then be implemented to correct any issues discovered in the recordings (e.g., failing to probe appropriately or using incorrect definitions of the concepts being measured). Importantly, the nonresponse error variance found in this study was not specific to a given interviewing technique. This is not entirely surprising; while we entertained the possibility that interviewers trained to use conversational interviewing might also vary in the types of individuals that they recruit, one would not generally expect the measurement technique that an organization trains an interviewer to use to affect recruitment outcomes. However, when we consider this finding alongside similar findings in prior studies of this type (West and Olson 2010; West etal. 2013), nonresponse error variance among interviewers does seem to be a general phenomenon that needs careful monitoring (especially in terms of the ages of recruited individuals). Different interviewers may emphasize different aspects of a study design (incentive, topic, sponsor, etc.) during recruitment, and this may be another source of nonresponse error variance, regardless of the interviewing technique being employed. We could not test this here, but this is certainly an open question for future research. Furthermore, the “cancelling out” of nonresponse error variance by respondent reports demonstrated in this study may not be ideal from a data quality perspective: if the mean respondent reports for different interviewers are quite similar but systematically different from the true mean, interviewer variance may be reduced, but the overall survey estimate may be biased.6 In addition, variable errors in the respondent reports could have a negative effect on the quality of estimates of regression coefficients. If the longest uninterrupted period of employment measure was used as a predictor variable in a regression model for some other economic outcome of interest, the absence of apparent interviewer variance in the respondent reports in the SI condition would not tell the whole story about the measurement problems for this variable. Given the evidence of nonresponse error variance in the SI condition, if respondents are providing reports with variable error that tend to be similar to the overall respondent mean and reduce the nonresponse error variance, these variable errors could attenuate estimates of regression coefficients toward zero. In the absence of record values on a sampling frame, survey managers could monitor the variability among interviewers, prior to measurement, in the values of “relevant” auxiliary variables and paradata (i.e., auxiliary information known to be correlated with the key variables) for recruited cases, and possibly intervene if particular interviewers seem to have unusual distributions. We note that this is a different strategy from simply intervening with interviewers who have low response rates. For example, if a survey organization linked the ages of household heads from a commercial database to sampled addresses (e.g., West, Wagner, Gu, and Hubbard 2015), and age was an important correlate of several key items in a given survey (e.g., self-rated health), interviewer variance in the ages of recruited respondents could be analyzed, and managers could intervene if particular interviewers seem to be primarily recruiting persons of a certain age. 4.2 Future Research Directions We first note that the decomposition approach used in this study relied on strong assumptions (e.g., interviewer-specific measurement errors and nonresponse errors are independent). Future research needs to focus on methods for decomposing interviewer variance that require fewer assumptions and allow these error sources to covary. West and Elliott (2015) have proposed one such idea based on multilevel modeling and multiple imputation, and future work in this area needs to further consider more elegant modeling approaches that relax these assumptions. Future studies should also continue to explore the “liking” hypothesis (e.g., older interviewers tend to recruit older respondents, which has important training implications) and examine whether interviewer characteristics effectively account for any observed nonresponse error variance. We only had access to a limited number of interviewer-level covariates in this study [prior experience with the PASS study (yes/no), a related study in Germany collecting similar types of measures (see Trappmann, Beste, Bethmann, and Müller 2013), age, and gender]. The randomized design of this study ensured that these characteristics were balanced between the two interviewing techniques. We added fixed effects of these covariates (with age centered at the mean of 57.26) to the four models fitted to each of the three variables analyzed in this study and found no evidence of the fixed effects being significantly different from zero in any of the models. Many of our interviewers were older and male, which limited the predictive utility of these variables, and several interviewers had the previous experience indicated above. While “liking” was not apparent here, it has been found in other studies (West and Blom 2017) and needs continued consideration. Finally, additional replications of this study would also be welcome for further understanding this problem. The results of this study are specific to one population (employed persons in Germany with complex employment histories), one questionnaire (mainly on employment history), and one data collection organization (infas). This prevents one from predicting whether a replication of this study in a different population and/or context would produce similar findings, motivating future replications. The specific population that we surveyed may weigh the survey participation decision and the topic of our survey quite differently than other populations (e.g., unemployed persons); this population also likely has unique features in terms of age (younger) and education (higher) relative to other populations. These differences have important implications for the nonresponse and measurement errors studied here, in terms of the recruitment process (where timing and the ages of the interviewers may be important factors), the topic of our survey, and the complexity of the questions being asked. We also learned during interviewer training that CI is essentially the norm for face-to-face surveys in Germany, and the interviewers actually found the SI training to be more difficult and unconventional. Interviewers in other countries (such as the United States) who may not be as comfortable with using CI may therefore apply this technique differently, possibly introducing more variance at one of the stages analyzed here, but this is difficult to predict in the absence of any replications. Additional replications of this study in other populations and/or contexts would help to build a set of empirical evidence regarding the interviewer variance introduced by these two techniques across different cultures and survey topics. Footnotes 1. In general, the more clarification that conversational interviewers provide, the longer the interviews take (Schober, Conrad, and Fricker 2004). 2. The overall survey included additional items that were not related to employment history, but we focus on these types of items in this study given the available validation data on the sampling frame. 3. For more information on the IEB database, please visit http://fdz.iab.de/en/FDZ_Individual_Data/Integrated_Employment_Biographies.aspx 4. Estimates of the fixed effects in the final models are available in the online supplementary materials. 5. One could identify interviewers with extreme deviations by computing predicted values of the random interviewer effects (Empirical Best Linear Unbiased Predictors, or EBLUPs) based on the estimated parameters in multilevel models fitted to the survey reports and testing whether the individual random effects are significantly different from zero. This approach is implemented, for example, in PROC MIXED of the SAS/STAT software, by adding the SOLUTION option to the RANDOM statement. 6. Additional analyses reported in West etal. (2017) suggest that this was not the case in this study. The “cancelling out” demonstrated here therefore suggests that interviewers with extreme means based on recruited individuals simply collected responses closer to the overall mean. REFERENCES Ackermann-Piek D. , Massing N. ( 2014 ), “ Interviewer Behavior and Interviewer Characteristics in PIAAC Germany ,” Methods, Data, Analyses , 8 , 199 – 222 . Bailar B. , Bailey L. , Stevens J. ( 1977 ), “ Measures of Interviewer Bias and Variance ,” Journal of Marketing Research , 14 , 337 – 343 . Google Scholar CrossRef Search ADS Belli R. F. ( 1998 ), “ The Structure of Autobiographical Memory and the Event History Calendar: Potential Improvements in the Quality of Retrospective Reports in Surveys ,” Memory , 6 , 383 – 406 . Google Scholar CrossRef Search ADS PubMed Belli R. F. , Lee E. H. , Stafford F. P. , Chou C.-H. ( 2004 ), “ Calendar and Question-list Survey Methods: Association between Interviewer Behaviors and Data Quality ,” Journal of Official Statistics , 20 , 185 – 218 . Belli R. F. , Shay W. L. , Stafford F. P. ( 2001 ), “ Event History Calendars and Question List Surveys: A Direct Comparison of Interviewing Methods ,” Public Opinion Quarterly , 65 , 45 – 74 . Google Scholar CrossRef Search ADS PubMed Bruckmeier K. , Müller G. , Riphahn R. T. ( 2015 ), “ Survey Misreporting of Welfare Receipt: Respondent, Interviewer, and Interview Characteristics ,” Economics Letters , 129 , 103 – 107 . Google Scholar CrossRef Search ADS Campanelli P. , Sturgis P. , Purdon S. ( 1997 ), Can You Hear Me Knocking: An Investigation into the Impact of Interviewers on Survey Response Rates , London : SCPR Survey Methods Centre . Conrad F. G. , Schober M. F. ( 2000 ), “ Clarifying Question Meaning in a Household Telephone Survey ,” Public Opinion Quarterly , 64 , 1 – 28 . Google Scholar CrossRef Search ADS PubMed Elliott M. R. , West B. T. ( 2015 ), “ Clustering by Interviewer’: A Source of Variance That Is Unaccounted for in Single-Stage Health Surveys ,” American Journal of Epidemiology , 182 , 118 – 126 . Google Scholar CrossRef Search ADS PubMed Fowler F. J. , Mangione T. W. ( 1990 ), Standardized Survey Interviewing: Minimizing Interviewer-Related Error , Newbury Park, CA : SAGE Publications . Google Scholar CrossRef Search ADS Groves R. M. ( 2004 ), “Chapter 8: The interviewer as a source of survey measurement error,” Survey Errors and Survey Costs (2nd ed.) , New York : Wiley-Interscience . Groves R. M. ( 2006 ), “ Nonresponse Rates and Nonresponse Bias in Household Surveys ,” Public Opinion Quarterly , 70 , 646 – 675 . Google Scholar CrossRef Search ADS Groves R. M. , Magilavy L. J. ( 1986 ), “ Measuring and Explaining Interviewer Effects in Centralized Telephone Surveys ,” Public Opinion Quarterly , 50 , 251 – 266 . Google Scholar CrossRef Search ADS Groves R. M. , McGonagle K. A. ( 2001 ), “ A Theory-Guided Interviewer Training Protocol Regarding Survey Participation ,” Journal of Official Statistics , 17 , 249 – 265 . Groves R. M. , Cialdini R. B. , Couper M. P. ( 1992 ), “ Understanding the Decision to Participate in a Survey ,” Public Opinion Quarterly , 56 , 475 – 495 . Google Scholar CrossRef Search ADS Groves R. M. , Peytcheva E. ( 2008 ), “ The Impact of Nonresponse Rates on Nonresponse Bias: A Meta-Analysis ,” Public Opinion Quarterly , 72 , 167 – 189 . Google Scholar CrossRef Search ADS Groves R. M. , Singer E. , Corning A. ( 2000 ), “ Leverage-Saliency Theory of Survey Participation ,” Public Opinion Quarterly , 64 , 299 – 308 . Google Scholar CrossRef Search ADS PubMed Haan M. , Ongena Y. , Huiskes M. ( 2013 ), “Interviewers’ Questions: Rewording Not Always a Bad Thing,” In: Interviewers' Deviations in Surveys: Impact, Reasons, Detection and Prevention , (edited by Winker P. , Menold N. , Porst R. ), Frankfurt : Peter Lang Academic Research . Henson R. , Cannell C. F. , Lawson S. ( 1976 ), “ Effects of Interviewer Style on Quality of Reporting in a Survey Interview ,” Journal of Psychology , 93 , 221 – 227 . Google Scholar CrossRef Search ADS Houtkoop-Steenstra H. ( 1995 ), “Meeting Both Ends: Between Standardization and Recipient Design in Telephone Survey Interviews,” Situated Order: Studies in the Social Organization of Talk and Embodied Activities , ed. ten Have P. , Psathas G. , pp. 91 – 107, Washington, DC : University Press of America . Hubbard F. , Antoun C. , Conrad F. G. ( 2012 ), “Conversational Interviewing, The Comprehension of Opinion Questions and Nonverbal Sensitivity,” Paper presented at the Annual Conference of the American Association for Public Opinion Research, Orlando, FL. Kish L. ( 1962 ), “ Studies of Interviewer Variance for Attitudinal Variables ,” Journal of the American Statistical Association , 57 , 92 – 115 . Google Scholar CrossRef Search ADS Mangione T. W. , Fowler F. J. , Louis T. A. ( 1992 ), “ Question Characteristics and Interviewer Effects ,” Journal of Official Statistics , 8 , 293 – 307 . Mittereder F. , Durow J. , West B. T. , Kreuter F. , Conrad F. G. (forthcoming), “ Interviewer-Respondent Interactions in Conversational and Standardized Interviewing ,” Field Methods . Morton-Williams J. ( 1993 ), Interviewer Approaches , Aldershot : Dartmouth Publishing Company Limited . Moser C. A. , Stuart A. ( 1953 ), “ An Experimental Study of Quota Sampling ,” Journal of the Royal Statistical Society, Series A , 116 , 349 – 405 . Google Scholar CrossRef Search ADS O’Muircheartaigh C. , Campanelli P. ( 1998 ), “ The Relative Impact of Interviewer Effects and Sample Design Effects on Survey Precision ,” Journal of the Royal Statistical Society, Series A , 161 , 63 – 77 . Google Scholar CrossRef Search ADS Peneff J. ( 1988 ), “ The Observers Observed: French Survey Researchers at Work ,” Social Problems , 35 , 520 – 535 . Google Scholar CrossRef Search ADS Sayles H. , Belli R. F. , Serrano E. ( 2010 ), “ Interviewer Variance Between Event History Calendar and Conventional Questionnaire Interviews ,” Public Opinion Quarterly , 74 , 140 – 153 . Google Scholar CrossRef Search ADS Schaeffer N. C. , Dykema J. , Maynard D. W. ( 2010 ), “Interviewers and Interviewing,” Handbook of Survey Research , ( 2nd ed .), ed. Wright James D. , Marsden Peter V. , pp. 437 – 470 , Bingley, UK : Emerald Group Publishing Limited . Schnell R. , Kreuter F. ( 2005 ), “ Separating Interviewer and Sampling-Point Effects ,” Journal of Official Statistics , 21 , 389 – 410 . Schober M. F. , Conrad F. G. ( 1997 ), “ Does Conversational Interviewing Reduce Survey Measurement Error? ,” Public Opinion Quarterly , 61 , 576 – 602 . Google Scholar CrossRef Search ADS Schober M. F. , Conrad F. G. , Dijkstra W. , Ongena Y. P. ( 2012 ), “ Disfluencies and Gaze Aversion in Unreliable Responses to Survey Questions ,” Journal of Official Statistics , 28 , 555 – 582 . Schober M. F. , Conrad F. G. , Fricker S. S. ( 2004 ), “ Misunderstanding Standardized Language in Research Interviews ,” Applied Cognitive Psychology , 18 , 169 – 188 . Google Scholar CrossRef Search ADS Snijkers G. , Hox J. J. , de Leeuw E. D. ( 1999 ), “ Interviewers' Tactics for Fighting Survey Nonresponse ,” Journal of Official Statistics , 15 , 185 – 198 . Stock J. S. , Hochstim J. R. ( 1951 ), “ A Method of Measuring Interviewer Variability ,” Public Opinion Quarterly , 15 , 322 – 334 . Google Scholar CrossRef Search ADS Stokes S. L. , Yeh M . ( 1988 ), “Searching for Causes of Interviewer Effects in Telephone Surveys,” Telephone Survey Methodology , ed. Groves R. M. , pp. 357 – 373, New York : John Wiley and Sons . Sturgis P. , Campanelli P. ( 1998 ), “ The Scope for Reducing Refusals in Household Surveys: An Invesigation Based on Transcripts of Tape-Recorded Doorstep Interactions ,” Journal of the Market Research Society , 40 , 121 – 139 . Suchman L. , Jordan B. ( 1990 ), “ Interactional Troubles in Face-to-Face Survey Interviews ,” Journal of the American Statistical Association , 85 , 232 – 241 . Google Scholar CrossRef Search ADS Trappmann M. , Beste J. , Bethmann A. , Müller G. ( 2013 ), “ The PASS Panel Survey After Six Waves ,” Journal of Labor Market Research , 46 , 275 – 281 . Google Scholar CrossRef Search ADS Tucker C. ( 1983 ), “ Interviewer Effects in Telephone Surveys ,” Public Opinion Quarterly , 47 , 84 – 95 . Google Scholar CrossRef Search ADS Vassallo R. , Durrant G. , Smith P. ( 2016 ), “ Separating Interviewer and Area Effects by Using a Cross-Classified Multilevel Logistic Model: Simulation Findings and Implications for Survey Designs ,” Journal of the Royal Statistical Society, Series A . doi: 10.1111/rssa.12206. West B. T. , Blom A. G. ( 2017 ), “ Explaining Interviewer Effects: A Research Synthesis ,” Journal of Survey Statistics and Methodology , 5 , 175 – 211 . West B. T. , Conrad F. G. , Kreuter F. , Mittereder F. ( 2017 ), “ Can Conversational Interviewing Improve Survey Response Quality Without Increasing Interviewer Effects? ,” Journal of the Royal Statistical Society, Series A . doi: 10.1111/rssa.12255. West B. T. , Elliott M. R. ( 2014 ), “ Frequentist and Bayesian Approaches for Comparing Interviewer Variance Components in Two Groups of Survey Interviewers ,” Survey Methodology , 40 , 163 – 188 . West B. T. , Elliott M. R. ( 2015 ), “New Methodologies for the Study and Decomposition of Interviewer Effects in Surveys,” Paper Presented at the Annual Meeting of the Statistical Society of Canada (SSC), Halifax, Nova Scotia, 6/17/15. West B. T. , Kreuter F. , Jaenichen U. ( 2013 ). “ Interviewer’ Effects in Face-to-face Surveys: A Function of Sampling, Measurement Error or Nonresponse? ,” Journal of Official Statistics , 29 , 277 – 297 . Google Scholar CrossRef Search ADS West B. T. , Olson K. ( 2010 ), “ How Much of Interviewer Variance is Really Nonresponse Error Variance? ,” Public Opinion Quarterly , 74 , 1004 – 1026 . Google Scholar CrossRef Search ADS West B. T. , Wagner J. , Gu H. , Hubbard F. ( 2015 ), “ The Utility of Alternative Commercial Data Sources for Survey Operations and Estimation: Evidence from the National Survey of Family Growth ,” Journal of Survey Statistics and Methodology , 3 , 240 – 264 . Google Scholar CrossRef Search ADS Zhang D. , Lin X. ( 2008 ), “Variance Component Testing in Generalized Linear Mixed Models for Longitudinal/Clustered Data and Other Related Topics,” Random Effect and Latent Variable Model Selection , ed. Dunson D. B. , Lecture Notes in Statistics, 192 , New York, NY : Springer . © The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Journal of Survey Statistics and MethodologyOxford University Press

Published: Sep 29, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off