A Method for Accounting for Classification Error in a Stratified Cellphone Sample

A Method for Accounting for Classification Error in a Stratified Cellphone Sample State-based telephone surveys are often designed to make estimates at substate levels, such as county or county group. Under a traditional random-digit-dial design, the telephone exchange of a landline number could be used to accurately identify the county for which the associated household resides. However, initially, no good analogous data methods existed for the cellphone frame. This required survey methodologists to draw random samples of cellphone numbers from the entire state, making it difficult to target areas within a state. To overcome this shortcoming, sample vendors have used a cellphone number’s rate center (where the number was activated) as a proxy estimate for the county where the cellphone owner resides. Our paper shows that county assignations that are based on rate center data may have classification error rates as high as 30%. These high classification error rates make it difficult to accurately devise a cellphone frame sample allocation using the rate center data. This paper proposes a new method—the Rate Center Plus method—which uses rate centers and an estimate of the classification probabilities to stratify and allocate the desired respondent sample to counties. The new method uses Bayes’ rule to distribute a desired county-level sample allocation across rate center counties. We demonstrate how the Rate Center Plus method was applied to the 2015 Ohio Medicaid Assessment Survey and the resulting efficacy of the method. Finally, we evaluate whether the new approach is more efficient than the traditional statewide sample method. In addition, we look at four approaches to estimating the necessary classification probabilities. We found that the Rate Center Plus method can be more cost efficient than the statewide sample method when the classification probabilities are reasonably estimated, reducing data collection costs as much as 12.8%. 1. INTRODUCTION State- or local-area–based surveys are often designed to make estimates at the county or county-group levels (eg., Ohio Medicaid Assessment Survey [OMAS] 2012; California Health Interview Survey 2014). Under a traditional random-digit-dial (RDD) design, the telephone exchange of a landline number could be used to accurately identify the county in which the associated household resides. However, initially, no good analogous data methods existed for the cellphone frame. This required survey methodologists to draw random samples of cellphone numbers from the entire state, making it difficult to target areas within a state. When the proportion of the sample allocated to the cellphone sample was relatively small and used mainly to ensure full coverage of the target population, the need to allocate the cellphone sample at the county level was not necessary. However, as the proportion of the population that are cellphone-only or cellphone-mostly users increases (National Health Interview Survey 2016), especially among young adults with children and minorities (Lu, Berzofsky, Sahr, Ferketich, Blanton et al. 2014), it is necessary to increase cellphone allocations to offset any loss in precision because of increased design effects (Peytchev and Neely 2013). The inaccuracy of where the cellphone respondent actually lives and where the survey design believes they live can cause an increase in variance (Skalland and Khare 2013). To allow cellphone samples to target substate areas better, sample vendors, such as Marketing System Group (MSG), have recently been able to identify a cellphone number’s rate center and determine the county in which the rate center most likely resides. A rate center is a geographical area used by a local exchange carrier (telephone company) to determine boundaries for local calling, billing, and assigning telephone numbers. When a person activates a cellphone, the sample vendors link the rate center in which the activation occurs to the cellphone number. The size of a rate center varies based on the geographic area. Therefore, the number of rate centers that fall within a county’s boundaries varies by county. For example, Cuyahoga County has fifteen rate centers, while Hamilton County has only two. Rate centers can be clustered into rate center counties based on the county in which most of a rate center falls (e.g., the fifteen rate centers in Cuyahoga County, Ohio). If the rate center county assignment accurately predicts where the cellphone user resides, then it can be used in the same way exchanges are used on the landline frame. However, if a classification error between where a person lives and their assigned rate center county occurs, methods are required to increase the likelihood that the desired substate allocation is achieved. Combining rate centers into rate center counties can minimize the amount of classification error, but this step is not necessary. Therefore, smaller geographic areas can be formed down to the rate center level. In surveys that sample rate centers near a state border, rate center classification errors have been found (e.g., Kafka, Chattopadhyay, and Chan 2015). 2. MOTIVATION State surveys often have analytic objectives that require allocations that deviate from a simple random sample (SRS). Some reasons for these alternate allocations include (1) oversampling certain areas to increase the respondent sample size of important subpopulations, ensuring a minimum level of precision for substate estimates and (2) minimizing the impact of the differential sampling rates on the estimates at the county level. Optimizing a sample allocation based on an objective function is discussed at length in both the survey research literature (e.g., Peytchev and Neely 2013; Levine and Harter 2015) and the operations research literature (e.g., Calinescu, Bhulai, and Schouten 2013). Regardless of the optimization goal, the ability to achieve the desired allocation depends on the ability to identify accurately on the frame where a respondent is geographically located. If the substate geographic identifier has an error, the efficiency of the design—both statistical and cost efficiency—will be reduced. However, if the classification error can be estimated and accurately incorporated into the design, the survey can achieve its optimization objective. In this paper, we develop a new method that accounts for classification errors in the sampling frame, which, if left unaddressed, would reduce the statistical and cost efficiency of the optimization approach used in the study design. Although the rate center county is not a perfect proxy for county of residence, it is the best measure currently available on the cellphone frame. Therefore, we propose the Rate Center Plus method to account for the measurable imperfections between the linkage of rate center county and county of residence. The Rate Center Plus method uses auxiliary data to estimate and account for the classification error in the rate center county assignments to properly allocate the cellphone sample. Through this process, we aim to show how the efficiency by which the desired allocation across substate areas is improved. Ultimately, our paper has two goals: (1) develop an empirical methodology to improve the accuracy and efficiency of allocating a targeted sample across sampling strata—which represents a proxy for geographic areas—to achieve a desired number of interviews in each area, and (2) show under what conditions the developed empirical approach works. Given the constraints which need to be met for such an allocation, an exact mathematical solution may not be possible without solutions for some areas existing only on the boundaries of a domain (e.g., negative values), which, in practical terms, can lead to unacceptable allocations. Therefore, our approach, using relatively inexpensive auxiliary data, seeks to optimize the solution to minimize the difference between the targeted allocation and the achieved allocation. To assess the conditions by which our method produces optimal results, we conduct two assessments. First, we apply information obtained from the 2012 OMAS to estimate the classification error between rate center county and county of residence and apply it to the 2015 OMAS. Second, we evaluate five sources of auxiliary information, including sources which can be used during a survey’s first administration, to determine which perform well and which yield inaccurate final respondent allocations. 3. RATE CENTER PLUS METHOD 3.1 Approach The Rate Center Plus method uses Bayes’ rule to achieve an accurate allocation of a sample across a set of smaller geographic units (e.g., counties within a state) when only a set of proxy geographic units (e.g., rate center counties) with unknown misclassification rates (or classification error rates) are available on the frame for stratification. The allocation relies on estimated classification probabilities between a cellphone number’s assigned rate center and the actual county in which the owner of the cellphone resides. Several approaches can be used to estimate the classification probabilities. Each of these approaches will result in a different allocation of the desired number of completed interviews for a county across rate center counties. As such, the resulting efficiency by which the county-level allocation is achieved will vary across approaches. These approaches include using prior survey data, a small targeted address sample, billing ZIP code data, or the distribution of 1,000-banks or assuming a proportional allocation. Section 5 presents an evaluation of these alternative methods for computing the classification errors. In general, the Rate Center Plus method can be completed in five basic steps: For each of K counties, determine misclassification rate associated with a cellphone numbers’ rate center county (RCC) assignment given the actual county (AC) in which the respondent resides. Compute these misclassification rates via the conditional probability that a responding cellphone number will be assigned to RCC k' given the AC k for which the cellphone respondent resides (i.e., pk'|k=PRCCk'|ACk for k=1,…, K and k'=1,…, K). See section 5 for evaluation of approaches for estimating this probability. Using the misclassification rates developed in step one, create a K×Kprobability matrix. Determine the desired number of completed interviews in each subarea (i.e., county) ( nk). This distribution will take into account any unequal allocation of sampling units (e.g., oversampling in certain counties or areas) required to achieve study objectives. Using Bayes’ rule, apply the desired allocation in step two to the probability matrix in Step 1 to obtain the expected number of respondents in actual county k given the respondent’s cellphone was assigned to rate center county k' (i.e., nk|k'=nk×pk'|k). Obtain the number of expected respondents per rate center county by summing the expected conditional sample allocation in step three by rate center county1 (i.e., nk'=∑i=1knk|k' for k=1,…, K). 3.2 Simple Example of Rate Center plus Method To illustrate the proposed approach, we (1) look at the relationship between classification error rates and the matrix probabilities for a county and (2) present a simple example. 3.2.1 Different types of counties To get a sense of how accurate rate center county is at predicting a cellphone respondent’s county of residence, we had the rate center county appended to the final set of 2012 OMAS respondents (where a statewide SRS of cellphone numbers was selected) and compared them with the reported county of residence. Figure 1 shows the distribution of respondents based on their assigned rate center county and their survey response by county and county type (metro, suburban, rural non-Appalachian, and rural Appalachian). In metro counties, the rate center county assignment overestimates the number of residents, but for other county types, the rate center county underestimates the number of residents. Treating the reported county of residence as the truth, we find that (1) metro counties have a higher false-positive rate and (2) nonmetro counties have a higher false-negative rate. Figure 1. View largeDownload slide 2012 Rate Center County Versus Actual County of Residence, by County Type. Figure 1. View largeDownload slide 2012 Rate Center County Versus Actual County of Residence, by County Type. To illustrate the impact that these differing classification error rates have on our proposed method, figures 2 and 3 show the distribution of the respondents in the 2012 OMAS from Hamilton County (a metro county) and Coshocton County (a rural county) in Ohio. Figure 2. View largeDownload slide Distribution of Hamilton County Respondents by Rate Center County, 2012 OMAS. Figure 2. View largeDownload slide Distribution of Hamilton County Respondents by Rate Center County, 2012 OMAS. Figure 3. View largeDownload slide Distribution of Coshocton County Respondents by Rate Center County, 2012 OMAS. Figure 3. View largeDownload slide Distribution of Coshocton County Respondents by Rate Center County, 2012 OMAS. The rate center county for Hamilton County has a 4.0% false-positive rate and a 7.2% false-negative rate, whereas the rate center county for Coshocton County has a false-positive rate of only 0.1% and a false-negative rate of 32.5%. For Hamilton County, the false-negative rate, relatively small when compared with Coshocton County’s rate, means that very few cellphone numbers assigned to counties other than the Hamilton rate center county belong to Ohio residents who live inside of Hamilton. Therefore, to obtain residents of Hamilton County, one needs to draw numbers largely from the Hamilton County rate center. The relatively large false-positive rate means that cellphone numbers assigned to the Hamilton rate center belong to residents of other counties. In this example, respondents from surrounding counties will come mostly from Hamilton. This finding is representative of the other metro counties in Ohio. Therefore, an oversample of telephone numbers needs to be selected from metro county rate centers because they will be the main supplier of interviews for metro and nonmetro counties. Unlike with Hamilton County, the relatively large false-negative rate in Coshocton County means that a significant portion of cellphone numbers that are actually in Coshocton are assigned to numbers in rate center counties other than Coshocton. The relatively small false-positive rate means that cellphone numbers assigned to the Coshocton rate center belong to Coshocton residents. Therefore, cellphone numbers selected in the Coshocton rate center are highly likely to belong to residents of Coshocton. These findings are representative of other nonmetro counties. Therefore, an undersample of cellphone numbers needs to be drawn from nonmetro rate center counties because a large portion of nonmetro county interviews will come from other rate center counties. 3.2.2 Simple example We fully illustrate the Rate Center Plus method with a sample design for a hypothetical state with five counties and a target of four hundred completed interviews across those five counties. County 1 and County 4 are metro counties; and County 2, County 3, and County 5 are nonmetro counties. Table 1 presents the allocation for the four hundred total interviews across the five counties ( nk), the conditional probability of a respondent’s cellphone number being assigned a rate center county given the actual county in which the respondent resides ( pk'|k), and the resulting sample size to be selected from each rate center ( nk'). These numbers correspond with the steps one –five described in section 3.1. Table 1. Desired Number of Interviews, Probability of Association with Rate Center County and Rate Center County Sample Size for State with Five Counties Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 Table 1. Desired Number of Interviews, Probability of Association with Rate Center County and Rate Center County Sample Size for State with Five Counties Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 For County 1, a metro county, table 1 shows that the design targets one hundred interviews. To obtain those interviews, in expectation, nk|k' is eighty for the County 1 rate center ( p1|1=80%), five for the County 2 rate center, zero for the County 3 rate center, one for the County 4 rate center, and fourteen for the County 5 rate center. However, because County 1 is a metro county, numbers assigned to its rate center will, in expectation, supply n2|1= 17.5 interviews to County 2, n3|1 = 2.4 interviews to County 3, n4|1 = 0 interviews to County 4, and n5|1 = 3 interviews to County 5. Therefore, the sample that will be allocated to the County 1 rate center will be large enough to obtain 103 interviews for it. By contrast, County 3, a nonmetro county, has a desired n3 = 80 completed interviews, but will only have n3|3 = 36 interviews allocated to the County 3 rate center, because many of the interviews will come from other rate center counties, including n3|2 = 12 from the County 2 rate center and n3|4 = 24 from the County 4 rate center. 4. APPLICATION TO 2015 OHIO MEDICAID ASSESSMENT SURVEY In applying the process to the 2015 OMAS, we had two main goals: (1) apply the proposed allocation approach to understand any issues or complications with the process and (2) assess how well the process worked in terms of its key assumptions and how well the final distribution of respondents matched the desired distribution. 4.1 Ohio Medicaid Assessment Survey The OMAS is a periodic survey of residents in Ohio. OMAS measures health insurance coverage and access to medical services among adults and children. Because the outcomes of interest are highly correlated to where a person lives, county-level estimates are critical to understanding the populations at greatest risk. Since its inception in 2004, OMAS has used an RDD telephone design. Beginning in 2008, OMAS moved to a dual-frame design but with a limited cellphone sample (approximately 5% of the total sample). In 2012, OMAS increased the proportion of desired interviews from the cellphone sample to 25.0% (Ohio Medicaid Assessment Survey 2012). The 2012 cellphone sample was a statewide random sample of cellphone numbers. At the time of the 2015 survey, according to the National Health Interview Survey (2016), 65.0% of adults and 75.2% of children lived in a cellphone-only or cellphone-mostly household. Moreover, data from the 2012 OMAS indicated that minorities, households with children, and residents at or around the poverty level were more likely to be contacted through the cellphone frame (Lu et al. 2014). On the basis of these findings, the cellphone allocation was increased to 55.0% of completed interviews. Furthermore, 40,000 total interviews (cellphone and landline combined) were targeted for the 2015 OMAS, with the objective of producing direct estimates at the county level in the majority of the 88 counties in Ohio. 4.2 Sample Allocation Using 2012 OMAS data, we implemented our proposed approach. For step one, we calculated pk'|k using the approximately 5,000 completed cellphone interviews conducted during the 2012 survey. The 2012 OMAS uses an SRS to select cellphone numbers across the state. After data collection, the rate center county associated with each responding telephone number was obtained to compute the conditional probabilities. It is worthy to note that, three counties in Ohio do not have any rate centers, which means that all of the desired interviews from those counties need to come from other counties (i.e., all interviews are dependent on the false-negative rates of other rate center counties). For step two, we created a 90 x 90 matrix. Ohio has eighty-eight counties, but we split the two mostly urban counties—Cuyahoga County, where Cleveland is located, and Franklin County, where Columbus is located—in two based on rate centers identified as having a higher density of African-Americans to oversample this population. One issue we encountered was that there were too few cellphone respondents in some counties in 2012 (<20) to obtain reliable classification errors and distributions of respondents across rate center county. When this occurred, we collapsed counties with neighboring counties of the same county type (e.g., metro, suburban, rural Appalachian, or rural non-Appalachian) to develop combined probabilities. For the probability matrix, twenty-six counties required a combined probability. The combined probabilities were assigned to the counties with twenty or fewer cellphone respondents in 2012.2 For step three, we initially allocated the desired 26,000 cellphone respondents proportionally across the ninety strata (i.e., nk). Adjustments were made to the allocation to ensure a minimum of forty-five completed interviews were obtained in each stratum. The desired sample size of the two high-density, African-American strata were increased to account for the oversample. See Berzofsky, Lu, Weston, Couzens, and Sahr (2015) for more details on the sample allocation. For step four, given the allocation of the 26,000 desired cellphone respondents across the ninety strata from step three and the probability matrix from step two, we determined the desired number of interviews within each stratum by rate center strata combination (i.e., nk'). For step five, within each rate center stratum, we summed across all ninety county strata to obtain the desired number of interviews within each rate center stratum. Figure 4 presents the final allocation across rate center strata compared with the desired number of interviews in each county strata. Strata with a dot on the forty-five degree line have the exact same desired number of interviews in a rate center stratum as the county stratum. As expected, metro counties/strata, for the most part, will obtain more interviews in the rate center stratum than the actual county. For example, Butler County has almost 2,500 interviews drawn from its rate center county, but only desires about 1,250 interviews from residents in that county. In total, thirty-eight strata have a larger desired sample in the rate center strata than their equivalent county strata and fifty strata have a smaller desired sample in the rate center strata than their equivalent county strata. Figure 4. View largeDownload slide Sample Dize Drawn Versus Target Sample Desired by County Type for OMAS 2015. Figure 4. View largeDownload slide Sample Dize Drawn Versus Target Sample Desired by County Type for OMAS 2015. 4.3 Assessing the Performance of the Approach The assessment of how well the Rate Center Plus method performed for the 2015 OMAS sample consisted of two components: (1) assess assumptions of the method and (2) assess the final distribution of respondents to the desired distribution of respondents based on the proposed approach. 4.3.1 Method assumptions Methods, such as the Rate Center Plus method, usually require a set of assumptions to be valid. To assess the efficacy of using prior survey data to develop the classification probabilities, two assumptions needed to be met: Assumption 1: The distribution of cellphone users by county (or substate area) in the current survey period is similar to the distribution of cellphone users in the period used to develop the classification probability matrix. Assumption 2: The classification error rates for each rate center county (or substate area) during the current survey period are similar to the classification error rates in the period used to produce the classification probability matrix. Assumption 1 is needed to help ensure that the distribution of where a county’s respondents will be found across rate center counties is similar. If the distribution of cellphone users has changed from when the classification probabilities were derived to when the survey was conducted, the distribution of where residents live according to rate center county has likely changed, as well. Assumption 2 is needed to help ensure that the classification probability matrix used for the allocation of rate center counties will produce the desired number of interviews in each county. For example, if a rate center county has a high false-positive rate in the initial period but not in the current period, too many interviews will be allocated to the rate center county (i.e., the actual county will obtain more interviews than desired while other counties will obtain fewer). Similarly, if the false-negative rate for a county has changed, the allocation assumption regarding how many interviews for a county will be obtained from a different rate center county will not hold, leading to fewer than desired interviews in the county. For surveys that have had two or more prior iterations, this assumption can be examined prior to the current study period. However, because Assumption 2 is based on the respondent pool, changes in eligibility criteria or nonresponse patterns may impact this assumption over time. To assess Assumption 1, we compared the final distribution of cellphone users by county (i.e., the distribution of cellphone users taking into account their final survey weight to adjust for unequal probabilities of selection) in 2012 and 2015. Using respondent data from the 2012 and 2015 OMAS, figure 5 presents the estimated distribution of the cellphone population in Ohio by County in 2012 and 2015. As the figure shows, most of the eighty-eight counties align closely across the two periods. This indicates that Assumption 1 held true and that the probability matrix used for the allocation was valid for the 2015 survey. Figure 5. View largeDownload slide Estimated Distribution of Cellphone Users in Ohio in 2012 and 2015 by County. Figure 5. View largeDownload slide Estimated Distribution of Cellphone Users in Ohio in 2012 and 2015 by County. To assess Assumption 2, we compared the false-positive and false-negative rates in each county type between 2012 and 2015. Table 2 presents the classification error rates by county type during the two periods. As the table shows, the classification error rates in each county type were reasonably consistent over time. This indicates that Assumption 2 holds, and that the number of cellphone numbers allocated to a rate center county will appropriately result in the number of desired interviews within each actual county. Furthermore, we compared the classification error rates at the county level. We found that, while there is variability across county types, the classification error rates were similar across the two time periods. Table 2. False-Positive and False-Negative Rates of Rate Center Accuracy in 2012 and 2015 by County Type False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% Table 2. False-Positive and False-Negative Rates of Rate Center Accuracy in 2012 and 2015 by County Type False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% When using prior survey data, an additional key criterion is that enough prior survey data is available in each county. As noted in our application, we collapsed counties that had fewer than twenty respondents in the 2012 OMAS. To ensure that the collapsing of counties for purposes of estimating the classification probabilities do not diminish their accuracy, one can examine the characteristics of persons in the counties being collapsed. In our case, we determined that counties within geographic proximity to each other and of the same type (e.g., metro, suburban, rural) had similar classification probabilities. If counties cannot be collapsed, an approach for estimating classification probabilities that do not require prior survey data may need to be considered (see section 5 for a discussion and evaluation of these approaches). 4.3.2 Final distribution of completed interviews Figure 6 shows the actual number of completed interviews in each stratum compared with the desired number of completed interviews. As can be seen, the distribution tracks fairly closely with the forty-five degree line. In fact, in 55.0% of strata, we achieved or exceeded our desired number of interviews. As the breakout of strata with a target less than 600 interviews in figure 6 shows, forty counties fall short of their target, and most those counties are still very close to their desired goal—twenty of these counties were between 80% and 99% of their target, while only seven counties deviated by more than twice or half of their target. Figure 6. View largeDownload slide Actual Number of Completed Interviews Versus Desired Number of Completed Interviews for All Strata and Strata with Expected Number of 600 or Fewer in OMAS 2015. Figure 6. View largeDownload slide Actual Number of Completed Interviews Versus Desired Number of Completed Interviews for All Strata and Strata with Expected Number of 600 or Fewer in OMAS 2015. 5. EVALUATION OF THE EFFICIENCY OF RATE CENTER PLUS METHOD The Rate Center Plus method aims to improve a survey designer’s ability to target and allocate samples to geographic areas below the state level. As such, the method is only beneficial if it improves upon the more standard statewide SRS of cellphone numbers in the state. In this section, we evaluate the cost efficiency of the Rate Center Plus method compared with the SRS method. A more cost-efficient design will require a smaller sample size to achieve the study sample size goals (e.g., the statistical efficient design), which translates into a reduction in data collection costs. The key component of the Rate Center Plus method—estimating the classification errors—can be achieved in various ways. While the 2015 OMAS used prior survey data to estimate the classification errors, this method may not always be possible (i.e., no prior survey data exists). Therefore, an evaluation of different methods and how they compare with an SRS in relation to cost efficiency can help survey designers determine which approach is most appropriate for their study. 5.1 Evaluation Criteria The main basis for comparison is how efficiently each approach achieves a set of predetermined substate targets. Cost efficiency is defined by the accuracy and precision to which each method achieves the substate targets (i.e., the more accurate the method, the lower the data collections costs are to achieve the sample optimization objective). To determine the more cost-efficient method, the following statistics were compared based on desired number of completed interviews in a county ( nk) and the final number of completed interviews in a county after applying the SRS or Rate Center Plus methods ( nk*) Total distance between nkand nk*. The distance from target statistics compares the amount over and amount under a method achieves at the county-type level. In other words, ∑nk*-nk|nk*>nk is the amount over and ∑nk*-nk|nk*<nk is the amount under. This criterion is important from a cost-efficiency perspective because a large respondent amount over the target is cost inefficient while a large respondent amount under the target may be detrimental to the statistical efficiency of the design. Mean absolute difference between nkand nk*. The mean absolute difference measures the average absolute distance between the target and the actual number of completed interviews among counties in each county type. This criterion effectively demonstrates which method will best minimize classification error in the allocation. Standard deviation of mean absolute difference between nkand nk*. The standard deviation of the mean absolute difference assesses the variation in the mean absolute difference across counties in the same county type. A large standard deviation indicates that within a county type, the range of the absolute differences is large. A smaller standard deviation indicates that most of the absolute differences for counties within a county type are near the average difference. 5.2 Classification Probabilities for Rate Center Plus Method As discussed in section 3, the first step in the Rate Center Plus method is to estimate the classification probabilities in the relationship between the actual county in which a cellphone respondent resides and the rate center county to which the respondent’s cellphone number was assigned. As described in section 4, for the 2015 OMAS, we used data from a prior survey iteration to compute the classification errors. However, for one-time studies or studies being conducted for the first time, prior study data may not be available. Therefore, for this evaluation, we use prior survey data and four additional methods for computing the classification errors that do not require a prior survey iteration. The four approaches included the following: Prior survey data. This method uses prior or existing survey data from the area of interest that asks the respondent their county of residence. Using the cellphone respondent’s identified county and the rate center county associated with the cellphone number, classification errors can be computed. For this approach, the rate center county needs to be appended to the cellphone number during or shortly after data collection to ensure the telephone number is still associated with the person sampled. Targeted address sample. This method uses a panel of cellphone-only respondents in an area of interest (e.g., Ohio) with known addresses. A sample of targeted address members is selected from each county within the state (e.g., we selected 150 people per county). Using the panel member’s cellphone number, the associated rate center county is appended. The resulting classification errors are computed accordingly. This method can be conducted before sample selection for a study to inform the sample allocation. Billing ZIP code. This method uses the billing ZIP code associated with a cellphone number as the actual county for a cellphone number. Using rate center county, a test sample of cellphone numbers is selected in each county (e.g., we selected enough cellphone numbers to obtain 150 numbers with an appended billing ZIP code per rate center county). When available, the billing ZIP code is appended to each telephone number. In Ohio, about one-third of cellphone numbers has an available billing ZIP code. A sample of cellphone numbers with the corresponding billing zip code appended can be purchased to determine the classification probabilities prior to allocating the study sample. Naïve Rate Center. This method treats rate center county as an error-free proxy for the actual county. As such, the exact sample size desired in a county is selected from each rate center. Of these approaches, only the prior survey data method requires past survey knowledge. The targeted address sample and the billing ZIP code methods require a small test sample to develop the classification probabilities. The naïve rate center approach, like the SRS, can be used without any prior information or test sample. For approaches that do not utilize prior survey data, it is worth noting that eligibility and response rates at the county level will not be known. While not in scope for this paper, once an allocation is determined, response and eligibility assumptions need to be applied to determine the starting sample size of cellphone numbers. The eligibility and response rates are likely to differ by county. Therefore, one advantage to having prior survey data is that these additional rates will be known at the county level. Furthermore, the targeted address method will identify people living in a county with out-of-state cellphone numbers. The rate center for these cellphone numbers will not be within the state. Therefore, these cellphone numbers will be excluded from the evaluation. 5.3 Evaluation For each method, the Rate Center Plus method was implemented through six steps to estimate the expected number of completed interviews per county. The evaluation used cellphone users in Ohio as the sampling population. Ohio has eighty-eight counties. Stratification and sample targets were set at the county level. The steps used in the estimation of the classification probabilities include the following: Calculate the classification probabilities ( pk'|k). For the SRS method, there is no classification probability because there is no stratification. For the naïve method, the classification probability is 1 when the actual county equals the rate center county (i.e., pk'|k=1 when k'=k). Produce the resulting 88 x 88 probability matrix. For the SRS method, no matrix is required because there is no stratification within the state. For the naïve rate center method, the probability matrix is an identity matrix (i.e., a probability of 1 along the matrix diagonal). Set the desired number of cellphone interviews in each county ( nk) to the 2015 OMAS targeted number. The 2015 OMAS targeted number of interviews per county was based on a quasiproportional allocation that set a minimum floor of forty-five interviews in each county (Berzofsky et al. 2015). Determine the number of completed interviews per rate center county ( nk'). For the prior survey data, targeted address, billing ZIP code, and naïve rate center methods, this was done as described in the fourth and fifth steps of the Rate Center Plus method. For the SRS method, the distribution was assumed to follow the distribution of 1,000 banks allocated to cellphone numbers in the state. The 1,000 banks are associated with a rate center county providing an expected distribution by rate center county. Compute the probability of a respondent being in an actual county given the rate center county for the respondent’s cellphone number using the respondent set from the 2015 OMAS (i.e., pk|k'). This probability was treated as the true respondent distribution across rate center county. Estimate the expected final distribution of completed interviews ( nk*). For each rate center county, the number of completed interviews was multiplied by the true respondent distribution to produce the expected number of completed interviews in the actual county given the rate center county (i.e., nk|k'=nk'×pk|k'). The expected final number of completed interviews in an actual county is equal to the sum of the respondents in the actual county given a rate center county (i.e., nk*=∑i=1k'nk|k'). 5.4 Evaluation Criteria The evaluation of each method revolved around the relationship between the targeted number of completed interviews in each county ( nk) and the expected number of completed interviews based on each classification method ( nk*). As such, evaluation criteria included the following: The overage summed in counties where nk>nk* and the shortfall summed in counties where nk<nk*, by county type The average absolute difference between nk and nk*, by county type The standard deviation of the average absolute difference, by county type 5.5 Evaluation Results Figure 7 presents the total overage in counties where the actual number of respondents is greater than the target and total shortfall in counties where the actual number of respondents is less than the target by county type. From an efficiency standpoint, the total overage and shortfall can be more meaningful than the average discrepancy for a method because it can better distinguish between statistical efficiency and cost efficiency. The overage amount represents the cost inefficiency in a design. The shortfall amount represents the statistical inefficiency in the design to meet the design’s optimization objective. In total, across all four county types, the targeted address and prior survey data methods had smaller overages and shortfalls than the SRS method, while the naïve rate center and billing zip code methods had higher overages and shortfalls than the SRS method. In total, across all four county types, the targeted address method had an overage of 2,378 interviews and a shortfall of 2,367 interviews, and the prior survey method had an overage of 3,319 interviews and a shortfall of 3,320 interviews. For both the overage and the shortfall, the targeted address method was 2,000 interviews more efficient and the prior survey method was 1,000 interviews more efficient than the SRS method (total overage of 4,208 and total shortfall of 4,408). As seen in the figure, the overall findings, for the most part, hold true within each of the four county types. This indicates that the targeted address and prior survey methods are consistent in their efficiency gains across different types of geography, compared with the SRS method. Figure 7. View largeDownload slide Total Overage and Shortfall between Targeted and Expected Respondent Sample Size by Classification Method and County Type. Figure 7. View largeDownload slide Total Overage and Shortfall between Targeted and Expected Respondent Sample Size by Classification Method and County Type. The Rate Center Plus method with probabilities computed using either prior data or a targeted sample reduced data collections costs compared with the SRS method. Given the desired 26,000 cellphone respondents, to achieve the substate targets in each county, the SRS method requires an increase in the number of cases screened (i.e., the overage plus the additional screening required to make up the shortfall) by 6.3% compared with the prior survey data method and increase in the numbers screened by 12.8% compared with the targeted address sample. Figure 8 presents the comparison of the average absolute difference between the targeted respondent sample size in a county and the expected respondent sample size under each method. Relative to the SRS method, the prior survey and targeted address approaches are consistently more accurate than the SRS method, while the naïve and billing ZIP code approaches are less accurate. Furthermore, the SRS method had particularly large absolute differences in the metro counties. This may be because of the high false-positive rates in the metro counties (i.e., people not residing in the metro county even though the rate center county indicates they do), which are not well identified under the SRS method. Figure 8. View largeDownload slide Average Absolute Difference between Targeted and Expected Respondent Sample Size by Classification Method and County Type. Figure 8. View largeDownload slide Average Absolute Difference between Targeted and Expected Respondent Sample Size by Classification Method and County Type. Table 3 presents the standard deviations by county type for the average absolute differences by classification method. The targeted address sample method has the smallest standard deviations across all county types (overall standard deviation of sixty-nine, compared with 291 for the SRS method). This indicates that the mean distance from the desired target is the least variable under the targeted address. The mean distance from the target being smaller under the targeted address sample method indicates that this method is the most consistently close to the desired target. Furthermore, the prior survey method approach has smaller or similar standard deviations than the SRS method across all county types except rural Appalachia. The prior survey data method has the overall second smallest standard deviation, but at the county-type level, the prior survey method and SRS method are second smallest in two county types each (metro and suburban for prior survey method, compared with rural Appalachian and rural non-Appalachian for the SRS method). Table 3. Standard Deviation for the Average Absolute Difference by Classification Method and County Type Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 Table 3. Standard Deviation for the Average Absolute Difference by Classification Method and County Type Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 5.6 Evaluation Conclusions On the basis of the results of the evaluation, the Rate Center Plus method can be more efficient than the SRS method, but it depends on the source for the classification probabilities. Prior survey data and the targeted address sample approaches are more efficient than the SRS method, while the naïve rate center and billing ZIP code methods are less efficient than the SRS method. The efficiency gains under the Rate Center Plus method translate into a 7% reduction in screening costs to achieve the substate targets when the prior survey data method is used to estimate the classification probabilities and a 15% reduction when the targeted address sample method is used. 6. CONCLUSIONS This paper proposes the Rate Center Plus method: a sample allocation method for cellphone samples that require stratification below the state level and shows how classification probabilities can be developed and used during the implementation of the method. The paper applies the Rate Center Plus method to the 2015 OMAS and assesses its performance. Finally, the paper evaluates five methods for determining the necessary classification probabilities. These methods offer options for survey methodologists with access to auxiliary data, including prior survey data, to help inform the production of the classification probabilities. The Rate Center Plus method can be used with one-time surveys, periodic cross-sectional surveys, and continuous data collection method surveys, such as the California Health Interview Survey.3 For any survey type, the Rate Center Plus method requires auxiliary data to estimate the classification probabilities. Based on our review, the information required is readily available to survey designers at no or little additional cost to data collection. For periodic or continuous data collection surveys, if sample sizes allow, prior survey data can be used to incorporate the Rate Center Plus method without additional cost to the survey. Our application to the 2015 OMAS showed that the method does help improve the efficiency of the sample allocation across rate center strata to obtain the desired number of interviews in the actual strata. While the Rate Center Plus method does not yield an exact allocation to the desired geographic areas, using the right auxiliary information, our method can closely approximate the desired allocation. In fact, our evaluation found that the Rate Center Plus method can more efficiently achieve substate sample targets than the SRS method, by as much as 15%. On the basis of our evaluation, prior survey data (based on a previous sample size of 5,000 interviews) and the targeted address sample provide classification probabilities that are more accurate than the SRS method. Therefore, the Rate Center Plus method can be an effective manner to efficiently allocate cellphone sample to substate areas, but care needs to be taken to ensure the source for the classification errors used is accurate. While our application focused on the distribution of the nominal completed interviews (an important measure to many survey sponsors), the Rate Center Plus method has the potential to be used to minimize the design effect or other measures of precision across strata. Future research can measure what the resulting design effect will be within a county given an allocation across rate center counties. This extension will incorporate the nominal sample size targets desired by survey sponsors while maximizing the effective sample size desired by survey methodologists. Footnotes 1 The desired number of completed interviews for each rate center should then be adjusted to account for nonresponse and ineligible cellphone numbers (e.g., nonworking) to get the number of cellphone numbers that need to be selected in each rate center county. 2 While twenty was used as a floor, most collapsed counties had more than forty respondents. In general, twenty respondents is probably not enough to generate precise estimates of the classification probabilities across eighty-eight counties. 3 Beginning in 2015, a version of the Rate Center Method was implemented in the California Health Interview Survey, but it is not documented in the literature yet. REFERENCES Berzofsky M. E. , Lu B. , Weston D. , Couzens G. L. , Sahr T. ( 2015 ), “Considerations for the Use of Small Area Analysis in Survey Analysis for Health Policy: Example from the 2015 Ohio Medicaid Assessment Survey,” Proceedings for 70th Annual American Association for Public Opinion Research Conference, pp. 3963–3976. California Health Interview Survey ( 2014 ), CHIS 2011-2012 Methodology Series: Report 1—Sample Design , Los Angeles, CA : UCLA Center for Health Policy Research . Calinescu M. , Bhulai S. , Schouten B. ( 2013 ), “ Optimal Resource Allocation in Survey Designs ,” European Journal of Operations Research , 226 , 115 – 121 . Google Scholar CrossRef Search ADS Kafka S. M. , Chattopadhyay M. , Chan A. ( 2015 ), “Cellphone Sampling at the State Level: Geographic Accuracy and Coverage Concerns,” paper presented at the 70th Association of American Public Opinion Research Conference, Hollywood, FL. Levine B. , Harter R. ( 2015 ), “ Optimal Allocation of Cell-Phone and Landline Respondents in Dual-Frame Surveys ,” Public Opinion Quarterly , 79 , 91 – 104 . Google Scholar CrossRef Search ADS Lu B. , Berzofsky M. E. , Sahr T. , Ferketich A. , Blanton C. W. , Tumin R. ( 2014 ), “Capturing Minority Populations in Telephone Surveys: Experiences from the Ohio Medicaid Assessment Survey Series,” poster presented at 69th Annual American Association for Public Opinion Research Conference, Anaheim, CA. National Health Interview Survey ( 2016 ), “Wireless Substitution: State-level Estimates from the National Health Interview Survey, 2015,” National Health Interview Survey Early Release Program, 2016, available at https://www.cdc.gov/nchs/data/nhis/earlyrelease/wireless_state_201608.pdf. Ohio Medicaid Assessment Survey ( 2012 ), 2012 Ohio Medicaid Assessment Survey: Sample Design and Methodology, available at https://osuwmcdigital.osu.edu/sitetool/sites/omaspublic/documents/2012_OMAS_SampleDesignMethodolgy_Final.pdf. Peytchev A. , Neely B. ( 2013 ), “ RDD Telephone Surveys Toward a Single-frame Cell-Phone Design ,” Public Opinion Quarterly , 77 , 283 – 304 . Google Scholar CrossRef Search ADS Skalland B. , Khare M. ( 2013 ), “ Geographic Inaccuracy of Cell Phone Samples and the Effect on Telephone Survey Bias, Variance, and Cost ,” Journal of Survey Statistics and Methodology , 1 1 , 46 – 65 . Google Scholar CrossRef Search ADS © The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Survey Statistics and Methodology Oxford University Press

A Method for Accounting for Classification Error in a Stratified Cellphone Sample

Loading next page...
 
/lp/ou_press/a-method-for-accounting-for-classification-error-in-a-stratified-40s1dvzMeP
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
2325-0984
eISSN
2325-0992
D.O.I.
10.1093/jssam/smx033
Publisher site
See Article on Publisher Site

Abstract

State-based telephone surveys are often designed to make estimates at substate levels, such as county or county group. Under a traditional random-digit-dial design, the telephone exchange of a landline number could be used to accurately identify the county for which the associated household resides. However, initially, no good analogous data methods existed for the cellphone frame. This required survey methodologists to draw random samples of cellphone numbers from the entire state, making it difficult to target areas within a state. To overcome this shortcoming, sample vendors have used a cellphone number’s rate center (where the number was activated) as a proxy estimate for the county where the cellphone owner resides. Our paper shows that county assignations that are based on rate center data may have classification error rates as high as 30%. These high classification error rates make it difficult to accurately devise a cellphone frame sample allocation using the rate center data. This paper proposes a new method—the Rate Center Plus method—which uses rate centers and an estimate of the classification probabilities to stratify and allocate the desired respondent sample to counties. The new method uses Bayes’ rule to distribute a desired county-level sample allocation across rate center counties. We demonstrate how the Rate Center Plus method was applied to the 2015 Ohio Medicaid Assessment Survey and the resulting efficacy of the method. Finally, we evaluate whether the new approach is more efficient than the traditional statewide sample method. In addition, we look at four approaches to estimating the necessary classification probabilities. We found that the Rate Center Plus method can be more cost efficient than the statewide sample method when the classification probabilities are reasonably estimated, reducing data collection costs as much as 12.8%. 1. INTRODUCTION State- or local-area–based surveys are often designed to make estimates at the county or county-group levels (eg., Ohio Medicaid Assessment Survey [OMAS] 2012; California Health Interview Survey 2014). Under a traditional random-digit-dial (RDD) design, the telephone exchange of a landline number could be used to accurately identify the county in which the associated household resides. However, initially, no good analogous data methods existed for the cellphone frame. This required survey methodologists to draw random samples of cellphone numbers from the entire state, making it difficult to target areas within a state. When the proportion of the sample allocated to the cellphone sample was relatively small and used mainly to ensure full coverage of the target population, the need to allocate the cellphone sample at the county level was not necessary. However, as the proportion of the population that are cellphone-only or cellphone-mostly users increases (National Health Interview Survey 2016), especially among young adults with children and minorities (Lu, Berzofsky, Sahr, Ferketich, Blanton et al. 2014), it is necessary to increase cellphone allocations to offset any loss in precision because of increased design effects (Peytchev and Neely 2013). The inaccuracy of where the cellphone respondent actually lives and where the survey design believes they live can cause an increase in variance (Skalland and Khare 2013). To allow cellphone samples to target substate areas better, sample vendors, such as Marketing System Group (MSG), have recently been able to identify a cellphone number’s rate center and determine the county in which the rate center most likely resides. A rate center is a geographical area used by a local exchange carrier (telephone company) to determine boundaries for local calling, billing, and assigning telephone numbers. When a person activates a cellphone, the sample vendors link the rate center in which the activation occurs to the cellphone number. The size of a rate center varies based on the geographic area. Therefore, the number of rate centers that fall within a county’s boundaries varies by county. For example, Cuyahoga County has fifteen rate centers, while Hamilton County has only two. Rate centers can be clustered into rate center counties based on the county in which most of a rate center falls (e.g., the fifteen rate centers in Cuyahoga County, Ohio). If the rate center county assignment accurately predicts where the cellphone user resides, then it can be used in the same way exchanges are used on the landline frame. However, if a classification error between where a person lives and their assigned rate center county occurs, methods are required to increase the likelihood that the desired substate allocation is achieved. Combining rate centers into rate center counties can minimize the amount of classification error, but this step is not necessary. Therefore, smaller geographic areas can be formed down to the rate center level. In surveys that sample rate centers near a state border, rate center classification errors have been found (e.g., Kafka, Chattopadhyay, and Chan 2015). 2. MOTIVATION State surveys often have analytic objectives that require allocations that deviate from a simple random sample (SRS). Some reasons for these alternate allocations include (1) oversampling certain areas to increase the respondent sample size of important subpopulations, ensuring a minimum level of precision for substate estimates and (2) minimizing the impact of the differential sampling rates on the estimates at the county level. Optimizing a sample allocation based on an objective function is discussed at length in both the survey research literature (e.g., Peytchev and Neely 2013; Levine and Harter 2015) and the operations research literature (e.g., Calinescu, Bhulai, and Schouten 2013). Regardless of the optimization goal, the ability to achieve the desired allocation depends on the ability to identify accurately on the frame where a respondent is geographically located. If the substate geographic identifier has an error, the efficiency of the design—both statistical and cost efficiency—will be reduced. However, if the classification error can be estimated and accurately incorporated into the design, the survey can achieve its optimization objective. In this paper, we develop a new method that accounts for classification errors in the sampling frame, which, if left unaddressed, would reduce the statistical and cost efficiency of the optimization approach used in the study design. Although the rate center county is not a perfect proxy for county of residence, it is the best measure currently available on the cellphone frame. Therefore, we propose the Rate Center Plus method to account for the measurable imperfections between the linkage of rate center county and county of residence. The Rate Center Plus method uses auxiliary data to estimate and account for the classification error in the rate center county assignments to properly allocate the cellphone sample. Through this process, we aim to show how the efficiency by which the desired allocation across substate areas is improved. Ultimately, our paper has two goals: (1) develop an empirical methodology to improve the accuracy and efficiency of allocating a targeted sample across sampling strata—which represents a proxy for geographic areas—to achieve a desired number of interviews in each area, and (2) show under what conditions the developed empirical approach works. Given the constraints which need to be met for such an allocation, an exact mathematical solution may not be possible without solutions for some areas existing only on the boundaries of a domain (e.g., negative values), which, in practical terms, can lead to unacceptable allocations. Therefore, our approach, using relatively inexpensive auxiliary data, seeks to optimize the solution to minimize the difference between the targeted allocation and the achieved allocation. To assess the conditions by which our method produces optimal results, we conduct two assessments. First, we apply information obtained from the 2012 OMAS to estimate the classification error between rate center county and county of residence and apply it to the 2015 OMAS. Second, we evaluate five sources of auxiliary information, including sources which can be used during a survey’s first administration, to determine which perform well and which yield inaccurate final respondent allocations. 3. RATE CENTER PLUS METHOD 3.1 Approach The Rate Center Plus method uses Bayes’ rule to achieve an accurate allocation of a sample across a set of smaller geographic units (e.g., counties within a state) when only a set of proxy geographic units (e.g., rate center counties) with unknown misclassification rates (or classification error rates) are available on the frame for stratification. The allocation relies on estimated classification probabilities between a cellphone number’s assigned rate center and the actual county in which the owner of the cellphone resides. Several approaches can be used to estimate the classification probabilities. Each of these approaches will result in a different allocation of the desired number of completed interviews for a county across rate center counties. As such, the resulting efficiency by which the county-level allocation is achieved will vary across approaches. These approaches include using prior survey data, a small targeted address sample, billing ZIP code data, or the distribution of 1,000-banks or assuming a proportional allocation. Section 5 presents an evaluation of these alternative methods for computing the classification errors. In general, the Rate Center Plus method can be completed in five basic steps: For each of K counties, determine misclassification rate associated with a cellphone numbers’ rate center county (RCC) assignment given the actual county (AC) in which the respondent resides. Compute these misclassification rates via the conditional probability that a responding cellphone number will be assigned to RCC k' given the AC k for which the cellphone respondent resides (i.e., pk'|k=PRCCk'|ACk for k=1,…, K and k'=1,…, K). See section 5 for evaluation of approaches for estimating this probability. Using the misclassification rates developed in step one, create a K×Kprobability matrix. Determine the desired number of completed interviews in each subarea (i.e., county) ( nk). This distribution will take into account any unequal allocation of sampling units (e.g., oversampling in certain counties or areas) required to achieve study objectives. Using Bayes’ rule, apply the desired allocation in step two to the probability matrix in Step 1 to obtain the expected number of respondents in actual county k given the respondent’s cellphone was assigned to rate center county k' (i.e., nk|k'=nk×pk'|k). Obtain the number of expected respondents per rate center county by summing the expected conditional sample allocation in step three by rate center county1 (i.e., nk'=∑i=1knk|k' for k=1,…, K). 3.2 Simple Example of Rate Center plus Method To illustrate the proposed approach, we (1) look at the relationship between classification error rates and the matrix probabilities for a county and (2) present a simple example. 3.2.1 Different types of counties To get a sense of how accurate rate center county is at predicting a cellphone respondent’s county of residence, we had the rate center county appended to the final set of 2012 OMAS respondents (where a statewide SRS of cellphone numbers was selected) and compared them with the reported county of residence. Figure 1 shows the distribution of respondents based on their assigned rate center county and their survey response by county and county type (metro, suburban, rural non-Appalachian, and rural Appalachian). In metro counties, the rate center county assignment overestimates the number of residents, but for other county types, the rate center county underestimates the number of residents. Treating the reported county of residence as the truth, we find that (1) metro counties have a higher false-positive rate and (2) nonmetro counties have a higher false-negative rate. Figure 1. View largeDownload slide 2012 Rate Center County Versus Actual County of Residence, by County Type. Figure 1. View largeDownload slide 2012 Rate Center County Versus Actual County of Residence, by County Type. To illustrate the impact that these differing classification error rates have on our proposed method, figures 2 and 3 show the distribution of the respondents in the 2012 OMAS from Hamilton County (a metro county) and Coshocton County (a rural county) in Ohio. Figure 2. View largeDownload slide Distribution of Hamilton County Respondents by Rate Center County, 2012 OMAS. Figure 2. View largeDownload slide Distribution of Hamilton County Respondents by Rate Center County, 2012 OMAS. Figure 3. View largeDownload slide Distribution of Coshocton County Respondents by Rate Center County, 2012 OMAS. Figure 3. View largeDownload slide Distribution of Coshocton County Respondents by Rate Center County, 2012 OMAS. The rate center county for Hamilton County has a 4.0% false-positive rate and a 7.2% false-negative rate, whereas the rate center county for Coshocton County has a false-positive rate of only 0.1% and a false-negative rate of 32.5%. For Hamilton County, the false-negative rate, relatively small when compared with Coshocton County’s rate, means that very few cellphone numbers assigned to counties other than the Hamilton rate center county belong to Ohio residents who live inside of Hamilton. Therefore, to obtain residents of Hamilton County, one needs to draw numbers largely from the Hamilton County rate center. The relatively large false-positive rate means that cellphone numbers assigned to the Hamilton rate center belong to residents of other counties. In this example, respondents from surrounding counties will come mostly from Hamilton. This finding is representative of the other metro counties in Ohio. Therefore, an oversample of telephone numbers needs to be selected from metro county rate centers because they will be the main supplier of interviews for metro and nonmetro counties. Unlike with Hamilton County, the relatively large false-negative rate in Coshocton County means that a significant portion of cellphone numbers that are actually in Coshocton are assigned to numbers in rate center counties other than Coshocton. The relatively small false-positive rate means that cellphone numbers assigned to the Coshocton rate center belong to Coshocton residents. Therefore, cellphone numbers selected in the Coshocton rate center are highly likely to belong to residents of Coshocton. These findings are representative of other nonmetro counties. Therefore, an undersample of cellphone numbers needs to be drawn from nonmetro rate center counties because a large portion of nonmetro county interviews will come from other rate center counties. 3.2.2 Simple example We fully illustrate the Rate Center Plus method with a sample design for a hypothetical state with five counties and a target of four hundred completed interviews across those five counties. County 1 and County 4 are metro counties; and County 2, County 3, and County 5 are nonmetro counties. Table 1 presents the allocation for the four hundred total interviews across the five counties ( nk), the conditional probability of a respondent’s cellphone number being assigned a rate center county given the actual county in which the respondent resides ( pk'|k), and the resulting sample size to be selected from each rate center ( nk'). These numbers correspond with the steps one –five described in section 3.1. Table 1. Desired Number of Interviews, Probability of Association with Rate Center County and Rate Center County Sample Size for State with Five Counties Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 Table 1. Desired Number of Interviews, Probability of Association with Rate Center County and Rate Center County Sample Size for State with Five Counties Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 Rate Center Actual Desired County 1 County 2 County 3 County 4 County 5 Total n Prob. n Prob. n Prob. n Prob. n Prob. n County 1 100 80% 80 5% 5 0% 0 1% 1 14% 14 100% County 2 50 35% 17.5 40% 20 10% 5 15% 7.5 0% 0 100% County 3 80 3% 2.4 15% 12 45% 36 30% 24 7% 5.6 100% County 4 140 0% 0 2% 2.8 2% 2.8 95% 133 1% 1.4 100% County 5 30 10% 3 5% 1.5 5% 1.5 60% 18 20% 6 100% Sample Size 400 103 41 45 184 27 For County 1, a metro county, table 1 shows that the design targets one hundred interviews. To obtain those interviews, in expectation, nk|k' is eighty for the County 1 rate center ( p1|1=80%), five for the County 2 rate center, zero for the County 3 rate center, one for the County 4 rate center, and fourteen for the County 5 rate center. However, because County 1 is a metro county, numbers assigned to its rate center will, in expectation, supply n2|1= 17.5 interviews to County 2, n3|1 = 2.4 interviews to County 3, n4|1 = 0 interviews to County 4, and n5|1 = 3 interviews to County 5. Therefore, the sample that will be allocated to the County 1 rate center will be large enough to obtain 103 interviews for it. By contrast, County 3, a nonmetro county, has a desired n3 = 80 completed interviews, but will only have n3|3 = 36 interviews allocated to the County 3 rate center, because many of the interviews will come from other rate center counties, including n3|2 = 12 from the County 2 rate center and n3|4 = 24 from the County 4 rate center. 4. APPLICATION TO 2015 OHIO MEDICAID ASSESSMENT SURVEY In applying the process to the 2015 OMAS, we had two main goals: (1) apply the proposed allocation approach to understand any issues or complications with the process and (2) assess how well the process worked in terms of its key assumptions and how well the final distribution of respondents matched the desired distribution. 4.1 Ohio Medicaid Assessment Survey The OMAS is a periodic survey of residents in Ohio. OMAS measures health insurance coverage and access to medical services among adults and children. Because the outcomes of interest are highly correlated to where a person lives, county-level estimates are critical to understanding the populations at greatest risk. Since its inception in 2004, OMAS has used an RDD telephone design. Beginning in 2008, OMAS moved to a dual-frame design but with a limited cellphone sample (approximately 5% of the total sample). In 2012, OMAS increased the proportion of desired interviews from the cellphone sample to 25.0% (Ohio Medicaid Assessment Survey 2012). The 2012 cellphone sample was a statewide random sample of cellphone numbers. At the time of the 2015 survey, according to the National Health Interview Survey (2016), 65.0% of adults and 75.2% of children lived in a cellphone-only or cellphone-mostly household. Moreover, data from the 2012 OMAS indicated that minorities, households with children, and residents at or around the poverty level were more likely to be contacted through the cellphone frame (Lu et al. 2014). On the basis of these findings, the cellphone allocation was increased to 55.0% of completed interviews. Furthermore, 40,000 total interviews (cellphone and landline combined) were targeted for the 2015 OMAS, with the objective of producing direct estimates at the county level in the majority of the 88 counties in Ohio. 4.2 Sample Allocation Using 2012 OMAS data, we implemented our proposed approach. For step one, we calculated pk'|k using the approximately 5,000 completed cellphone interviews conducted during the 2012 survey. The 2012 OMAS uses an SRS to select cellphone numbers across the state. After data collection, the rate center county associated with each responding telephone number was obtained to compute the conditional probabilities. It is worthy to note that, three counties in Ohio do not have any rate centers, which means that all of the desired interviews from those counties need to come from other counties (i.e., all interviews are dependent on the false-negative rates of other rate center counties). For step two, we created a 90 x 90 matrix. Ohio has eighty-eight counties, but we split the two mostly urban counties—Cuyahoga County, where Cleveland is located, and Franklin County, where Columbus is located—in two based on rate centers identified as having a higher density of African-Americans to oversample this population. One issue we encountered was that there were too few cellphone respondents in some counties in 2012 (<20) to obtain reliable classification errors and distributions of respondents across rate center county. When this occurred, we collapsed counties with neighboring counties of the same county type (e.g., metro, suburban, rural Appalachian, or rural non-Appalachian) to develop combined probabilities. For the probability matrix, twenty-six counties required a combined probability. The combined probabilities were assigned to the counties with twenty or fewer cellphone respondents in 2012.2 For step three, we initially allocated the desired 26,000 cellphone respondents proportionally across the ninety strata (i.e., nk). Adjustments were made to the allocation to ensure a minimum of forty-five completed interviews were obtained in each stratum. The desired sample size of the two high-density, African-American strata were increased to account for the oversample. See Berzofsky, Lu, Weston, Couzens, and Sahr (2015) for more details on the sample allocation. For step four, given the allocation of the 26,000 desired cellphone respondents across the ninety strata from step three and the probability matrix from step two, we determined the desired number of interviews within each stratum by rate center strata combination (i.e., nk'). For step five, within each rate center stratum, we summed across all ninety county strata to obtain the desired number of interviews within each rate center stratum. Figure 4 presents the final allocation across rate center strata compared with the desired number of interviews in each county strata. Strata with a dot on the forty-five degree line have the exact same desired number of interviews in a rate center stratum as the county stratum. As expected, metro counties/strata, for the most part, will obtain more interviews in the rate center stratum than the actual county. For example, Butler County has almost 2,500 interviews drawn from its rate center county, but only desires about 1,250 interviews from residents in that county. In total, thirty-eight strata have a larger desired sample in the rate center strata than their equivalent county strata and fifty strata have a smaller desired sample in the rate center strata than their equivalent county strata. Figure 4. View largeDownload slide Sample Dize Drawn Versus Target Sample Desired by County Type for OMAS 2015. Figure 4. View largeDownload slide Sample Dize Drawn Versus Target Sample Desired by County Type for OMAS 2015. 4.3 Assessing the Performance of the Approach The assessment of how well the Rate Center Plus method performed for the 2015 OMAS sample consisted of two components: (1) assess assumptions of the method and (2) assess the final distribution of respondents to the desired distribution of respondents based on the proposed approach. 4.3.1 Method assumptions Methods, such as the Rate Center Plus method, usually require a set of assumptions to be valid. To assess the efficacy of using prior survey data to develop the classification probabilities, two assumptions needed to be met: Assumption 1: The distribution of cellphone users by county (or substate area) in the current survey period is similar to the distribution of cellphone users in the period used to develop the classification probability matrix. Assumption 2: The classification error rates for each rate center county (or substate area) during the current survey period are similar to the classification error rates in the period used to produce the classification probability matrix. Assumption 1 is needed to help ensure that the distribution of where a county’s respondents will be found across rate center counties is similar. If the distribution of cellphone users has changed from when the classification probabilities were derived to when the survey was conducted, the distribution of where residents live according to rate center county has likely changed, as well. Assumption 2 is needed to help ensure that the classification probability matrix used for the allocation of rate center counties will produce the desired number of interviews in each county. For example, if a rate center county has a high false-positive rate in the initial period but not in the current period, too many interviews will be allocated to the rate center county (i.e., the actual county will obtain more interviews than desired while other counties will obtain fewer). Similarly, if the false-negative rate for a county has changed, the allocation assumption regarding how many interviews for a county will be obtained from a different rate center county will not hold, leading to fewer than desired interviews in the county. For surveys that have had two or more prior iterations, this assumption can be examined prior to the current study period. However, because Assumption 2 is based on the respondent pool, changes in eligibility criteria or nonresponse patterns may impact this assumption over time. To assess Assumption 1, we compared the final distribution of cellphone users by county (i.e., the distribution of cellphone users taking into account their final survey weight to adjust for unequal probabilities of selection) in 2012 and 2015. Using respondent data from the 2012 and 2015 OMAS, figure 5 presents the estimated distribution of the cellphone population in Ohio by County in 2012 and 2015. As the figure shows, most of the eighty-eight counties align closely across the two periods. This indicates that Assumption 1 held true and that the probability matrix used for the allocation was valid for the 2015 survey. Figure 5. View largeDownload slide Estimated Distribution of Cellphone Users in Ohio in 2012 and 2015 by County. Figure 5. View largeDownload slide Estimated Distribution of Cellphone Users in Ohio in 2012 and 2015 by County. To assess Assumption 2, we compared the false-positive and false-negative rates in each county type between 2012 and 2015. Table 2 presents the classification error rates by county type during the two periods. As the table shows, the classification error rates in each county type were reasonably consistent over time. This indicates that Assumption 2 holds, and that the number of cellphone numbers allocated to a rate center county will appropriately result in the number of desired interviews within each actual county. Furthermore, we compared the classification error rates at the county level. We found that, while there is variability across county types, the classification error rates were similar across the two time periods. Table 2. False-Positive and False-Negative Rates of Rate Center Accuracy in 2012 and 2015 by County Type False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% Table 2. False-Positive and False-Negative Rates of Rate Center Accuracy in 2012 and 2015 by County Type False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% False-positive rate False-negative rate 2012 2015 2012 2015 Metro 2.7% 2.1% 17.0% 21.8% Rural Appalachian 0.1% 0.2% 38.6% 41.4% Rural Non-Appalachian 0.2% 0.2% 44.2% 47.2% Suburban 0.3% 0.4% 58.8% 60.1% When using prior survey data, an additional key criterion is that enough prior survey data is available in each county. As noted in our application, we collapsed counties that had fewer than twenty respondents in the 2012 OMAS. To ensure that the collapsing of counties for purposes of estimating the classification probabilities do not diminish their accuracy, one can examine the characteristics of persons in the counties being collapsed. In our case, we determined that counties within geographic proximity to each other and of the same type (e.g., metro, suburban, rural) had similar classification probabilities. If counties cannot be collapsed, an approach for estimating classification probabilities that do not require prior survey data may need to be considered (see section 5 for a discussion and evaluation of these approaches). 4.3.2 Final distribution of completed interviews Figure 6 shows the actual number of completed interviews in each stratum compared with the desired number of completed interviews. As can be seen, the distribution tracks fairly closely with the forty-five degree line. In fact, in 55.0% of strata, we achieved or exceeded our desired number of interviews. As the breakout of strata with a target less than 600 interviews in figure 6 shows, forty counties fall short of their target, and most those counties are still very close to their desired goal—twenty of these counties were between 80% and 99% of their target, while only seven counties deviated by more than twice or half of their target. Figure 6. View largeDownload slide Actual Number of Completed Interviews Versus Desired Number of Completed Interviews for All Strata and Strata with Expected Number of 600 or Fewer in OMAS 2015. Figure 6. View largeDownload slide Actual Number of Completed Interviews Versus Desired Number of Completed Interviews for All Strata and Strata with Expected Number of 600 or Fewer in OMAS 2015. 5. EVALUATION OF THE EFFICIENCY OF RATE CENTER PLUS METHOD The Rate Center Plus method aims to improve a survey designer’s ability to target and allocate samples to geographic areas below the state level. As such, the method is only beneficial if it improves upon the more standard statewide SRS of cellphone numbers in the state. In this section, we evaluate the cost efficiency of the Rate Center Plus method compared with the SRS method. A more cost-efficient design will require a smaller sample size to achieve the study sample size goals (e.g., the statistical efficient design), which translates into a reduction in data collection costs. The key component of the Rate Center Plus method—estimating the classification errors—can be achieved in various ways. While the 2015 OMAS used prior survey data to estimate the classification errors, this method may not always be possible (i.e., no prior survey data exists). Therefore, an evaluation of different methods and how they compare with an SRS in relation to cost efficiency can help survey designers determine which approach is most appropriate for their study. 5.1 Evaluation Criteria The main basis for comparison is how efficiently each approach achieves a set of predetermined substate targets. Cost efficiency is defined by the accuracy and precision to which each method achieves the substate targets (i.e., the more accurate the method, the lower the data collections costs are to achieve the sample optimization objective). To determine the more cost-efficient method, the following statistics were compared based on desired number of completed interviews in a county ( nk) and the final number of completed interviews in a county after applying the SRS or Rate Center Plus methods ( nk*) Total distance between nkand nk*. The distance from target statistics compares the amount over and amount under a method achieves at the county-type level. In other words, ∑nk*-nk|nk*>nk is the amount over and ∑nk*-nk|nk*<nk is the amount under. This criterion is important from a cost-efficiency perspective because a large respondent amount over the target is cost inefficient while a large respondent amount under the target may be detrimental to the statistical efficiency of the design. Mean absolute difference between nkand nk*. The mean absolute difference measures the average absolute distance between the target and the actual number of completed interviews among counties in each county type. This criterion effectively demonstrates which method will best minimize classification error in the allocation. Standard deviation of mean absolute difference between nkand nk*. The standard deviation of the mean absolute difference assesses the variation in the mean absolute difference across counties in the same county type. A large standard deviation indicates that within a county type, the range of the absolute differences is large. A smaller standard deviation indicates that most of the absolute differences for counties within a county type are near the average difference. 5.2 Classification Probabilities for Rate Center Plus Method As discussed in section 3, the first step in the Rate Center Plus method is to estimate the classification probabilities in the relationship between the actual county in which a cellphone respondent resides and the rate center county to which the respondent’s cellphone number was assigned. As described in section 4, for the 2015 OMAS, we used data from a prior survey iteration to compute the classification errors. However, for one-time studies or studies being conducted for the first time, prior study data may not be available. Therefore, for this evaluation, we use prior survey data and four additional methods for computing the classification errors that do not require a prior survey iteration. The four approaches included the following: Prior survey data. This method uses prior or existing survey data from the area of interest that asks the respondent their county of residence. Using the cellphone respondent’s identified county and the rate center county associated with the cellphone number, classification errors can be computed. For this approach, the rate center county needs to be appended to the cellphone number during or shortly after data collection to ensure the telephone number is still associated with the person sampled. Targeted address sample. This method uses a panel of cellphone-only respondents in an area of interest (e.g., Ohio) with known addresses. A sample of targeted address members is selected from each county within the state (e.g., we selected 150 people per county). Using the panel member’s cellphone number, the associated rate center county is appended. The resulting classification errors are computed accordingly. This method can be conducted before sample selection for a study to inform the sample allocation. Billing ZIP code. This method uses the billing ZIP code associated with a cellphone number as the actual county for a cellphone number. Using rate center county, a test sample of cellphone numbers is selected in each county (e.g., we selected enough cellphone numbers to obtain 150 numbers with an appended billing ZIP code per rate center county). When available, the billing ZIP code is appended to each telephone number. In Ohio, about one-third of cellphone numbers has an available billing ZIP code. A sample of cellphone numbers with the corresponding billing zip code appended can be purchased to determine the classification probabilities prior to allocating the study sample. Naïve Rate Center. This method treats rate center county as an error-free proxy for the actual county. As such, the exact sample size desired in a county is selected from each rate center. Of these approaches, only the prior survey data method requires past survey knowledge. The targeted address sample and the billing ZIP code methods require a small test sample to develop the classification probabilities. The naïve rate center approach, like the SRS, can be used without any prior information or test sample. For approaches that do not utilize prior survey data, it is worth noting that eligibility and response rates at the county level will not be known. While not in scope for this paper, once an allocation is determined, response and eligibility assumptions need to be applied to determine the starting sample size of cellphone numbers. The eligibility and response rates are likely to differ by county. Therefore, one advantage to having prior survey data is that these additional rates will be known at the county level. Furthermore, the targeted address method will identify people living in a county with out-of-state cellphone numbers. The rate center for these cellphone numbers will not be within the state. Therefore, these cellphone numbers will be excluded from the evaluation. 5.3 Evaluation For each method, the Rate Center Plus method was implemented through six steps to estimate the expected number of completed interviews per county. The evaluation used cellphone users in Ohio as the sampling population. Ohio has eighty-eight counties. Stratification and sample targets were set at the county level. The steps used in the estimation of the classification probabilities include the following: Calculate the classification probabilities ( pk'|k). For the SRS method, there is no classification probability because there is no stratification. For the naïve method, the classification probability is 1 when the actual county equals the rate center county (i.e., pk'|k=1 when k'=k). Produce the resulting 88 x 88 probability matrix. For the SRS method, no matrix is required because there is no stratification within the state. For the naïve rate center method, the probability matrix is an identity matrix (i.e., a probability of 1 along the matrix diagonal). Set the desired number of cellphone interviews in each county ( nk) to the 2015 OMAS targeted number. The 2015 OMAS targeted number of interviews per county was based on a quasiproportional allocation that set a minimum floor of forty-five interviews in each county (Berzofsky et al. 2015). Determine the number of completed interviews per rate center county ( nk'). For the prior survey data, targeted address, billing ZIP code, and naïve rate center methods, this was done as described in the fourth and fifth steps of the Rate Center Plus method. For the SRS method, the distribution was assumed to follow the distribution of 1,000 banks allocated to cellphone numbers in the state. The 1,000 banks are associated with a rate center county providing an expected distribution by rate center county. Compute the probability of a respondent being in an actual county given the rate center county for the respondent’s cellphone number using the respondent set from the 2015 OMAS (i.e., pk|k'). This probability was treated as the true respondent distribution across rate center county. Estimate the expected final distribution of completed interviews ( nk*). For each rate center county, the number of completed interviews was multiplied by the true respondent distribution to produce the expected number of completed interviews in the actual county given the rate center county (i.e., nk|k'=nk'×pk|k'). The expected final number of completed interviews in an actual county is equal to the sum of the respondents in the actual county given a rate center county (i.e., nk*=∑i=1k'nk|k'). 5.4 Evaluation Criteria The evaluation of each method revolved around the relationship between the targeted number of completed interviews in each county ( nk) and the expected number of completed interviews based on each classification method ( nk*). As such, evaluation criteria included the following: The overage summed in counties where nk>nk* and the shortfall summed in counties where nk<nk*, by county type The average absolute difference between nk and nk*, by county type The standard deviation of the average absolute difference, by county type 5.5 Evaluation Results Figure 7 presents the total overage in counties where the actual number of respondents is greater than the target and total shortfall in counties where the actual number of respondents is less than the target by county type. From an efficiency standpoint, the total overage and shortfall can be more meaningful than the average discrepancy for a method because it can better distinguish between statistical efficiency and cost efficiency. The overage amount represents the cost inefficiency in a design. The shortfall amount represents the statistical inefficiency in the design to meet the design’s optimization objective. In total, across all four county types, the targeted address and prior survey data methods had smaller overages and shortfalls than the SRS method, while the naïve rate center and billing zip code methods had higher overages and shortfalls than the SRS method. In total, across all four county types, the targeted address method had an overage of 2,378 interviews and a shortfall of 2,367 interviews, and the prior survey method had an overage of 3,319 interviews and a shortfall of 3,320 interviews. For both the overage and the shortfall, the targeted address method was 2,000 interviews more efficient and the prior survey method was 1,000 interviews more efficient than the SRS method (total overage of 4,208 and total shortfall of 4,408). As seen in the figure, the overall findings, for the most part, hold true within each of the four county types. This indicates that the targeted address and prior survey methods are consistent in their efficiency gains across different types of geography, compared with the SRS method. Figure 7. View largeDownload slide Total Overage and Shortfall between Targeted and Expected Respondent Sample Size by Classification Method and County Type. Figure 7. View largeDownload slide Total Overage and Shortfall between Targeted and Expected Respondent Sample Size by Classification Method and County Type. The Rate Center Plus method with probabilities computed using either prior data or a targeted sample reduced data collections costs compared with the SRS method. Given the desired 26,000 cellphone respondents, to achieve the substate targets in each county, the SRS method requires an increase in the number of cases screened (i.e., the overage plus the additional screening required to make up the shortfall) by 6.3% compared with the prior survey data method and increase in the numbers screened by 12.8% compared with the targeted address sample. Figure 8 presents the comparison of the average absolute difference between the targeted respondent sample size in a county and the expected respondent sample size under each method. Relative to the SRS method, the prior survey and targeted address approaches are consistently more accurate than the SRS method, while the naïve and billing ZIP code approaches are less accurate. Furthermore, the SRS method had particularly large absolute differences in the metro counties. This may be because of the high false-positive rates in the metro counties (i.e., people not residing in the metro county even though the rate center county indicates they do), which are not well identified under the SRS method. Figure 8. View largeDownload slide Average Absolute Difference between Targeted and Expected Respondent Sample Size by Classification Method and County Type. Figure 8. View largeDownload slide Average Absolute Difference between Targeted and Expected Respondent Sample Size by Classification Method and County Type. Table 3 presents the standard deviations by county type for the average absolute differences by classification method. The targeted address sample method has the smallest standard deviations across all county types (overall standard deviation of sixty-nine, compared with 291 for the SRS method). This indicates that the mean distance from the desired target is the least variable under the targeted address. The mean distance from the target being smaller under the targeted address sample method indicates that this method is the most consistently close to the desired target. Furthermore, the prior survey method approach has smaller or similar standard deviations than the SRS method across all county types except rural Appalachia. The prior survey data method has the overall second smallest standard deviation, but at the county-type level, the prior survey method and SRS method are second smallest in two county types each (metro and suburban for prior survey method, compared with rural Appalachian and rural non-Appalachian for the SRS method). Table 3. Standard Deviation for the Average Absolute Difference by Classification Method and County Type Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 Table 3. Standard Deviation for the Average Absolute Difference by Classification Method and County Type Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 Standard deviation for average absolute difference Method All counties Metro Rural Appalachian Rural non-Appalachian Suburban SRS 291 743 44 37 63 Naïve 277 499 255 142 57 Prior survey data 105 229 49 57 46 Billing ZIP 233 493 47 42 80 Targeted address 69 133 27 24 44 5.6 Evaluation Conclusions On the basis of the results of the evaluation, the Rate Center Plus method can be more efficient than the SRS method, but it depends on the source for the classification probabilities. Prior survey data and the targeted address sample approaches are more efficient than the SRS method, while the naïve rate center and billing ZIP code methods are less efficient than the SRS method. The efficiency gains under the Rate Center Plus method translate into a 7% reduction in screening costs to achieve the substate targets when the prior survey data method is used to estimate the classification probabilities and a 15% reduction when the targeted address sample method is used. 6. CONCLUSIONS This paper proposes the Rate Center Plus method: a sample allocation method for cellphone samples that require stratification below the state level and shows how classification probabilities can be developed and used during the implementation of the method. The paper applies the Rate Center Plus method to the 2015 OMAS and assesses its performance. Finally, the paper evaluates five methods for determining the necessary classification probabilities. These methods offer options for survey methodologists with access to auxiliary data, including prior survey data, to help inform the production of the classification probabilities. The Rate Center Plus method can be used with one-time surveys, periodic cross-sectional surveys, and continuous data collection method surveys, such as the California Health Interview Survey.3 For any survey type, the Rate Center Plus method requires auxiliary data to estimate the classification probabilities. Based on our review, the information required is readily available to survey designers at no or little additional cost to data collection. For periodic or continuous data collection surveys, if sample sizes allow, prior survey data can be used to incorporate the Rate Center Plus method without additional cost to the survey. Our application to the 2015 OMAS showed that the method does help improve the efficiency of the sample allocation across rate center strata to obtain the desired number of interviews in the actual strata. While the Rate Center Plus method does not yield an exact allocation to the desired geographic areas, using the right auxiliary information, our method can closely approximate the desired allocation. In fact, our evaluation found that the Rate Center Plus method can more efficiently achieve substate sample targets than the SRS method, by as much as 15%. On the basis of our evaluation, prior survey data (based on a previous sample size of 5,000 interviews) and the targeted address sample provide classification probabilities that are more accurate than the SRS method. Therefore, the Rate Center Plus method can be an effective manner to efficiently allocate cellphone sample to substate areas, but care needs to be taken to ensure the source for the classification errors used is accurate. While our application focused on the distribution of the nominal completed interviews (an important measure to many survey sponsors), the Rate Center Plus method has the potential to be used to minimize the design effect or other measures of precision across strata. Future research can measure what the resulting design effect will be within a county given an allocation across rate center counties. This extension will incorporate the nominal sample size targets desired by survey sponsors while maximizing the effective sample size desired by survey methodologists. Footnotes 1 The desired number of completed interviews for each rate center should then be adjusted to account for nonresponse and ineligible cellphone numbers (e.g., nonworking) to get the number of cellphone numbers that need to be selected in each rate center county. 2 While twenty was used as a floor, most collapsed counties had more than forty respondents. In general, twenty respondents is probably not enough to generate precise estimates of the classification probabilities across eighty-eight counties. 3 Beginning in 2015, a version of the Rate Center Method was implemented in the California Health Interview Survey, but it is not documented in the literature yet. REFERENCES Berzofsky M. E. , Lu B. , Weston D. , Couzens G. L. , Sahr T. ( 2015 ), “Considerations for the Use of Small Area Analysis in Survey Analysis for Health Policy: Example from the 2015 Ohio Medicaid Assessment Survey,” Proceedings for 70th Annual American Association for Public Opinion Research Conference, pp. 3963–3976. California Health Interview Survey ( 2014 ), CHIS 2011-2012 Methodology Series: Report 1—Sample Design , Los Angeles, CA : UCLA Center for Health Policy Research . Calinescu M. , Bhulai S. , Schouten B. ( 2013 ), “ Optimal Resource Allocation in Survey Designs ,” European Journal of Operations Research , 226 , 115 – 121 . Google Scholar CrossRef Search ADS Kafka S. M. , Chattopadhyay M. , Chan A. ( 2015 ), “Cellphone Sampling at the State Level: Geographic Accuracy and Coverage Concerns,” paper presented at the 70th Association of American Public Opinion Research Conference, Hollywood, FL. Levine B. , Harter R. ( 2015 ), “ Optimal Allocation of Cell-Phone and Landline Respondents in Dual-Frame Surveys ,” Public Opinion Quarterly , 79 , 91 – 104 . Google Scholar CrossRef Search ADS Lu B. , Berzofsky M. E. , Sahr T. , Ferketich A. , Blanton C. W. , Tumin R. ( 2014 ), “Capturing Minority Populations in Telephone Surveys: Experiences from the Ohio Medicaid Assessment Survey Series,” poster presented at 69th Annual American Association for Public Opinion Research Conference, Anaheim, CA. National Health Interview Survey ( 2016 ), “Wireless Substitution: State-level Estimates from the National Health Interview Survey, 2015,” National Health Interview Survey Early Release Program, 2016, available at https://www.cdc.gov/nchs/data/nhis/earlyrelease/wireless_state_201608.pdf. Ohio Medicaid Assessment Survey ( 2012 ), 2012 Ohio Medicaid Assessment Survey: Sample Design and Methodology, available at https://osuwmcdigital.osu.edu/sitetool/sites/omaspublic/documents/2012_OMAS_SampleDesignMethodolgy_Final.pdf. Peytchev A. , Neely B. ( 2013 ), “ RDD Telephone Surveys Toward a Single-frame Cell-Phone Design ,” Public Opinion Quarterly , 77 , 283 – 304 . Google Scholar CrossRef Search ADS Skalland B. , Khare M. ( 2013 ), “ Geographic Inaccuracy of Cell Phone Samples and the Effect on Telephone Survey Bias, Variance, and Cost ,” Journal of Survey Statistics and Methodology , 1 1 , 46 – 65 . Google Scholar CrossRef Search ADS © The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Journal of Survey Statistics and MethodologyOxford University Press

Published: Oct 23, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off