Discussion of “Probability Sampling by Connecting Space with Households Using GIS/GPS Technologies” by Chen, X.; Xu, X.; Gong, J.; Yan, Y.; Fang, L.

Discussion of “Probability Sampling by Connecting Space with Households Using GIS/GPS... Abstract Many innovative, new sampling plans that have been proposed for use in “resource limited” settings use GIS, GPS and Google Earth technology. In this issue of JSSAM, Chen, et al. describe a GIS/GPS-assisted area probability sampling method and its application to an epidemiological study designed to investigate the relationship between social capital and HIV risk behaviors among rural immigrants to the city of Wuhan, China. The following discussion examines the strengths and weaknesses of the Chen et al. sampling plan and related applications of GIS/GPS/Google Earth technology to area probability sampling of household target populations. The aim is to stimulate additional methodological research on GIS/GPS-assisted area probability sample designs, focused on the real constraints that exist in resource limited settings but with an eye to opportunities that the GIS/GPS and Google Earth technology provide to construct designs that are optimal under those constraints. INTRODUCTION Despite debate over the need for external validity in some forms of epidemiological research (Keiding and Louis 2016), there is no question that probability sample surveys are important tools in studies of disease prevalence and general health-related characteristics of target populations. In the United States, the National Health Interview Survey (NHIS) and the National Health and Nutrition Examination Survey (NHANES) provide critical longitudinal data series on the health of US adults and children. Similar national health survey programs in countries around the globe are fielded independently by national health ministries or as part of multinational programs of coordinated health surveys (Kessler Haro, Heeringa, Pennell, and Bedirhan 2006; Moussavi, Chatterji, Verdes, Tandon, Patel, et al. 2007). Beyond functioning as simple tools for “surveillance” of population health, probability sample surveys that embed retrospective case-control designs (Laing, Schottenfeld, Lacey, Gillespie, Garabrant, et al. 2001) or serve as a baseline for a prospective cohort study (Langa, Nlassman, Wallace, Herzog, Heeringa, et al. 2005) or even a randomized control trial (Elwood 1982) are employed in epidemiological studies of the incidence and etiology of health conditions or the effectiveness of programs designed to address adverse health outcomes. Chen et al. describe the application of a GIS/GPS-assisted area probability sampling method in an epidemiological study designed to investigate the relationship between social capital and HIV risk behaviors among rural immigrants to the city of Wuhan, China. I am very pleased to see this paper appear in JSSAM for several reasons. First, the paper highlights the importance that advances in geographic information system (GIS) software, global positioning systems (GPS) and satellite, aerial and other geo-imaging tools, such as Google Earth and Street View, can now play in probability sample designs for households and other populations. In the decades before the advent of modern GIS software and GPS, satellite imagery and remote sensing technology were used to guide sample design and improve estimation for surveys of agricultural production and natural resources. A leading example in such early use of advanced technology is the Landsat program, which was launched in 1972 and is still actively gathering data (https://landsat.usgs.gov/; last accessed March 1, 2018). Historically, innovative spatial sampling and estimation methods based on geographic coordinate systems have been widely used in the fields of geology, natural resources, and wildlife population dynamics (Thompson 1992). The US Bureau of Census’s public release of the Topologically Integrated Geographic Encoding and Referencing (TIGER) digitized maps (data and shape files) at the time of the 1990 decennial census revolutionized the selection of area probability samples for US households. Precursors to today’s sophisticated software tools such as ArcGIS (https://www.arcgis.com; last accessed March 1, 2018) enabled sample designers to use these digitized map resources in combination with census and other geographic data to visualize spatial distributions of populations and to improve the efficiency of sample stratification and sample allocation in multistage area probability sample designs (Heeringa, Connor, Haeussler, Redmond, and Samonte 1994). The sophistication and usability of commercially available GIS software advanced rapidly in the 1990s, and the Global Positioning System (GPS), which had been created by the United States Department of Defense (DOD) in the 1970s, became more widely used for navigation by the general public. However, civilian devices for GPS geolocation and navigation were subject to a perturbation of coordinates (roughly fifty meters) termed “selective availability.” Hence, we saw in the 1990s the “zig zag” pattern for linear street displays on maps produced from the original TIGER files. The US government turned off “selective availability” to GPS satellites in May of 2000, enabling further rapid advances in civilian applications for GPS and GIS-related technology. The advent of Google Earth in 2001 was one such advance. Google Earth integrated overlapping satellite images with the global GPS coordinate system, enabling users to locate and visualize areas of the Earth’s surface with a high degree of resolution—a resolution fine enough to locate and assign coordinates to individual structures as small as a typical dwelling unit. See for example Wampler et al. (2013) application of Google Earth in a sample design for a study of rural households in Haiti. A second reason that I appreciate seeing this paper in publication is that it exposes survey statisticians and methodologists to emerging ideas for identifying sample subjects in research settings where resources needed to design and select probability samples of the target population are limited. Papers describing such new and innovative methods for sample selection are often published in specialized or disciplinary journals that are not commonly read by most of us who are more in the mainstream of survey statistics and methodology. This may be due to the particular scientific focus and “one-off” nature of many of the research applications for these designs. And it may possibly be due to some submission and editorial bias in our more familiar journals against papers that are largely descriptive of new design methods. There is no formal definition of what constitutes such a “resource limited” or “resource poor” survey design. Here, a useful working definition includes design challenges where the abilities to efficiently construct appropriate sample frames or operationalize the steps required to develop a probability sample of the target population are restricted. In the growing fields of research on outbreaks of epidemics or postdisaster needs assessment (Chang, Parrales, Jimenez, Sobiesszczyk, Hammer, et al. 2009; Pietrzak, Tracy, Galea, Kilpatrick, Ruggerio, et al. 2012), a limiting factor is time—sample designs that are highly adaptable and quick to implement must be employed. Although less common than thirty years ago, there are still household populations for which accurate census data and maps needed to select conventional area probability samples do not exist or are not shared with civilian populations. More commonly, as is the case in the Wuhan City study of HIV risk among recent rural to urban migrants conducted by Chen et al., existing census data, maps, registries, or administrative lists may be obsolete or uninformative for mobile, rare, or hard-to-survey (HTS, Tourangeau, Edwards, Johnson, Wolter, and Bates 2014) target populations. Many of the new sampling plans that have been proposed for use in such “resource limited” settings use GIS, GPS, and Google Earth technology in various ways. In settings where residential structures are easily distinguished in high-resolution satellite views, Google Earth can be used to create a frame of eligible units. Field staff can then use exact GPS coordinates obtained from the Google Earth view of the study area to identify sample units on the ground (Wampler et al. 2013; Escamilla et al. 2014). Other GIS/GPS-assisted approaches such as that described by Chang et al. (2009) employ a variation of area probability sampling and use the satellite views in Google Earth to physically divide the study area into primary stage sampling units (PSUs) with recognizable boundaries. Dwelling units located within the boundaries of sampled PSUs are then enumerated by field teams and screened or subjected to an additional stage of sampling in much the same fashion as enumeration areas or area segments are listed and sampled in more conventional area probability sample designs. Singh and Clark (2013) describe a spatial sampling of residential structures and dwellings in Johannesburg that employs GIS and locally available geo-data bases to stratify the area in which target populations reside and enumerate structures for contact and screening. The sampling plan and sample selection software described by Chen et al. make maximum use of GIS/GPS and Google Earth in selecting a multistage sample of dwellings from a defined target population area. In doing so, they use technology to bypass several of the more time-consuming and resource-intensive activities required to develop the multistage sample frame and implement the sequence of sampling steps required to cluster, enumerate, sample, and screen a household population. To rigorously evaluate the strengths and weaknesses of this GIS/GPS assisted design method, it is useful to contrast the major sample selection steps and design features of their approach to that of a traditional area probability sample for a household target population. The aim in organizing the discussion in this way is to emphasize strengths and weaknesses of their GIS/GPS-assisted sampling plan and point out several places where the design might be modified to address identified weaknesses. Table 1 compares a traditional area sample design with the GIS/GPS-assisted method. Table 1. Comparison of Conventional Multistage Area Probability Sampling with the GIS/GPS Assisted Approach of Chen et al. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Table 1. Comparison of Conventional Multistage Area Probability Sampling with the GIS/GPS Assisted Approach of Chen et al. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. The second column of table 1 outlines the general approach to the selection of area probability samples for surveying household populations in settings where resources are not extremely limited. With modest adaptation to local geopolitical and population census methods, similar multistage designs are used world-wide for in-person surveys (Heeringa and O’Muircheartaigh 2010). The GIS/GPS-assisted design described in Chen et al. is also a multistage area probability sample design in which the initial two stages of sampling involve selection of area units. The third stage is the selection of households, and the final stage is a selection of respondents from household rosters. Assuming a fixed study budget to minimize sampling error and sampling bias in the selected sample, an area probability sampling plan can be evaluated using several criteria: achieving high coverage of the study target population controlling the true inclusion probability for sample cases maintaining optimal average size of PSU-level (ultimate) clusters and minimizing variation in this cluster size across PSUs minimizing non-informative variation in the final inclusion probabilities and corresponding weights used in analysis To review the advantages and disadvantages of the various sampling stages of their GIS/GPS-assisted sampling plan, it is convenient to use the authors’ example of a study of social capital and HIV risk in rural migrants to Wuhan, China. A primary aim of that study was to estimate the total number of resident rural migrants to Wuhan that are HIV positive—an epidemiological prevalence study that is well matched to a probability sample survey approach. PRIMARY STAGE OF SAMPLING UNDER THE GIS/GPS-ASSISTED APPROACH Area probability sampling draws its name from the fact that in the initial stages of sample frame development, the sampling units are defined non-overlapping geographic units. With the assumption that all households and individuals can be linked 1:1 (or with known multiplicity) to each such geographic parcel, the probability sampling of areas is also a probability sampling of the population linked to those areas. At the primary stage of sample selection, the grid-overlay system employed in the authors’ GIS/GPS-assisted method has the advantage that it does in theory provide complete coverage of the land area in the Wuhan City survey population. The authors describe in some detail the key initial step in their GIS/GPS sampling plan of determining study sample size (n) and the optimal allocation of the sample to each of the multiple stages of selection. The steps in sample size determination and multistage sample allocation do not differ from standard practice when the GIS/GPS-assisted sampling plan is used. Subject to total cost constraints, the total actual sample size is determined by the effective sample size needed to achieve precision targets for key survey estimates and adjustment for design effects that arise from the features (stratification, clustering, differential selection probabilities) of the complex multistage sample design (Heeringa and Ziniel 2012). The optimal size of the ultimate clusters (and hence the number of PSUs) is guided by cost factors and the intraclass correlation of the key variables in the target population. See Kish (1965) for formulas to calculate the optimal cluster size in multistage samples. In settings where informative data on the distribution of the target population to the geo-unit areas defined by the coordinate grid cannot be obtained, the GIS/GPS-assisted sampling plan uses simple random sampling or stratified random sampling to select an equal probability sample of those PSUs for further stages of sampling. The primary-stage inclusion probabilities for GPS-defined geo-units (e.g., 100 m2 areas) selected under each of these equal probability sampling methods are readily established. As the authors point out, due to heterogeneity in the distribution of the target population over the study area (see table 1 of their paper), equal probability sampling of the GPS-defined PSUs at this initial stage introduces substantial variation into the multistage chain of probabilities that will determine final selection probabilities and associated sampling weights for individual respondents. In most survey design settings, such random variability in inclusion probabilities for elements of the target population will result in increased variance of sample estimates of population statistics relative to a self-weighting sample of equivalent size. For example, in the Wuhan City study of rural-to urban migrants, the mean weight value for respondents is 57.57 with a standard deviation of 126.13. Assuming the weight values are independent of the survey measure of interest (e.g., HIV risk), the variance of an estimated mean or proportion based on that measure is approximately 1+ cv2(W) or 1+ (126.13/57.57)2 = 5.8 times that for a self-weighting sample of equal size (Kish 1965, Chapter 11). If GIS/GPS-assisted grid-sampling designs of this type are used in practice, it is important to take every possible step to minimize unnecessary variation in sample selection probabilities for eligible individuals in the target population. In multistage area probability sample designs for general household surveys, control over final selection probabilities for sample households is most efficiently achieved by sampling area-based units (PSUs and area segments) with probability proportionate to size (PPS) and selecting dwellings within area segments at a rate that is inversely proportional to the probability of selection of area units in the preceding stages. See, for example, the design for a study of HIV risk among immigrant populations in Johannesburg, South Africa, that is described by Singh and Clark (2013). However, this technique of sampling PSUs and area segments with PPS and dwellings with probability inversely proportionate to size only works well if the assigned measures of size for the area units are reasonably accurate. For the Wuhan City study design, the ideal measures of size for PSUs and area segments would be the number of rural migrant households that reside in each area unit. These population data were not available for the Wuhan City study population. In such cases, an alternative suggested by the authors is to use Google Earth satellite imagery to visually examine the target population area in more detail. The visual examination of high-resolution images of the study area may serve to assign approximate measures of size (e.g., structure counts and dwelling counts) to each geounit PSU or to stratify the grid-based PSUs according to the approximate size of the target population within the PSUs’ geographic boundaries (Cochran 1977). If stratification of PSUs by size is employed, the allocation of sample PSUs to each size stratum should then be set approximately proportionate to the expected share of the target population residing in each stratum. This stratification by size of the target population within the PSU will attenuate but not eliminate the unwanted variation in inclusion probabilities for the multistage sample of households and individuals (Cochran 1977). Following the selection of PSUs from the GIS grid overlay on the study area, the coordinates of the selected area units are loaded onto a GPS device that is used by the field staff member to locate the exact boundaries of a sampled PSU. This is where the two-dimensional world of most GIS mapping software and databases meets the irregular, three-dimensional reality of residential housing arrangements. Unlike conventional area probability samples where area sample units such as PSUs and area segments are defined according to distinct, recognizable geographic boundaries, the precise lines of the GPS-determined boundaries for selected PSUs, when projected onto the study area, will intersect streets and other boundaries at odd angles, and in urban areas, the boundaries are likely to bisect residential buildings (see Figure A.2 in the appendix to the Chen et al. paper). Field staff will need to be supplied with objective rules and carefully trained to make 1:1 assignments of dwellings to PSUs at this stage in order to avoid potential problems of selective coverage of dwellings on the PSU boundaries. This is clearly an area of needed methodological research and development as GIS/GPS-assisted area sampling strategies of the proposed type are employed in surveys of household populations. SECOND, THIRD AND FINAL STAGES OF SAMPLING IN THE GIS/GPS-ASSISTED APPROACH After the field staff have identified the boundaries of the PSU using their GPS device, they initiate the second stage of sample selection. An innovative but complicating feature of the authors’ GIS/GPS-assisted sampling plan for household populations lies in its use of a random walk method to sample and screen a cluster of eligible households and respondents (see the third stage in table 1). This random walk method of household enumeration within the sample PSU also defines the size and location of the second stage area segment from which households are sampled (see the second stage in table 1). As described in the paper, the random walk begins at “the main entrance of a street in urban areas or the beginning of a village in rural areas.” The field interviewer continues the random walk, screening and identifying households with eligible respondents until a fixed target of eligible households has been identified or an identifiable termination point (e.g., a new street intersection) is reached. At that point, the interviewer terminates their random walk enumeration. As the authors note, this methodology has the advantage of creating balance in the interviewer workloads across PSUs—the random walk route is terminated when the desired number of eligible individuals has been identified. By allowing the physical size of the second stage area segment to expand until the target number of eligible households is identified, it also offsets some of the inequality in selection probabilities for the target population (and respondent weights) that is due to selection of PSUs without controlling for the associated size of the eligible population. Note, however, by using the random walk and allowing the size of the enumerated area segment within each PSU to expand until a target number of eligible households has been identified, the probability of selection, f2, for the second stage area segment cannot be directly calculated—it must be estimated. The authors propose two methods for estimating f2, both of which require the field interviewer to use their GPS instrument to accurately outline the boundaries of the residential area that they covered in their random walk. This geo-positioning data will be used later by the central office to estimate the size of the actual residential area covered by the random walk route and the probability of selection for the area segment covered by the random walk within the PSU. Mathematically, the two ratio methods for estimating the area segment selection probability are reasonable. The critical variables in this process are the rules by which the local interviewer performs the local measurements of the area segment boundaries, which in turn lead to an estimated residential area, Ag, for the enumerated area segment. In my view, a major weakness of the proposed GIS/GPS-assisted sampling plan is the use of the random walk methodology to enumerate and select sample dwellings within sampled geo-units. Although the random walk is widely used in resource-limited sampling of household populations for epidemiological and other studies, published research by Bauer (2016) and earlier research summarized in Heeringa and Ziniel (2012) suggests that despite their attempt at “randomness,” random walk protocols do not achieve the objective of sampling households with known—let alone equal—probability. Determination of individual selection probabilities becomes even more ambiguous when the random walk is allowed to meander outside the fixed boundaries of the sample PSU (see Figure A.2 in the appendix to the Chen et al. paper). Although this GIS/GPS-assisted sampling plan is designed to be employed in resource-limited settings, there is an alternative to the random walk that requires only modest additional work by the field staff and which should eliminate much of the serious noncoverage and uncertainty over inclusion probabilities that can arise with the random walk method of enumeration. The alternative is the technique of “segmenting” a larger area sample unit in the field and randomly selecting segmented subparts of the area for the intensive tasks of dwelling enumeration, household sampling, and screening (see Kish 1965, Section 9.7D). This is a technique that is still widely used in area probability sampling in large rural areas and in urban areas where defined area sample units contain very large numbers of dwellings. Subparts of the original geounit with identifiable boundaries can be defined on location by the field staff and randomly ordered for household enumeration and screening. To achieve control over the cluster size of eligible households within each primary stage geo-unit, randomly ordered subparts can be introduced sequentially to the enumeration until the target cluster size for the geounit PSU is met or exceeded. Once a subpart is introduced to the second stage sample for the PSU, all dwellings in that subpart should be enumerated. If the enumeration produces a larger number of eligible households than desired, individual households can be subsampled prior to final household and respondent selection. In the Wuhan City study, the authors recommend the use of standard methods for completing a roster of eligible household members and randomly selecting one or more individuals and survey respondents. This methodology is consistent with the standard practice for conventional area probability surveys. SURVEY WEIGHTING AND ESTIMATION UNDER THE GIS/GPS-ASSISTED APPROACH The authors propose the following expression for calculating individual sample selection weights for respondents selected using the multistage GIS/GPS-assisted design. The individual weights calculated using this formula may in turn be used to compute design-based, weighted estimates of descriptive statistics for the target population. W^i=     Wg     ⋅     Wgh⋅    Wghi      =R^a⋅Ag   ⋅   TgHg   ⋅   Nghnghwhere:Wg=the combined weight factor for sampling of PSUαand area segmentg;Wgh=the weight factor for sampling householdhfrom area segmentg;Wghi=the weight factor for sampling respondentiin householdh;R^=the Method1or2estimate of total residential area in the survey population;a=total number of primray stage selections,α=1,…,a;Ag=field staff measure of random walk area in area segmentgand PSUα;Tg= total number of enumerated households in area segmentg;Hg=total number of sampled households in area segmentg;Ngh=the total number of eligible individuals in householdh;ngh=the total number of randomly sampled individuals in householdh. A “^” is placed over the Wi since the expression for this weight includes a quantity, R^ that must be estimated from field observations that determine the value of Ag. As noted above, the total error in the assigned weights will depend in large part on the consistency and accuracy of field staffs’ GPS coordinate measures that are used to determine the individual values of Ag. Assuming probability sampling and accurate determination of the sample inclusion probabilities and corresponding weights, the multistage design for authors’ GIS/GPS-assisted sampling method is fully compatible with the available alternatives for robust estimation of sampling variances for complex samples (Heeringa, West, and Berglund 2017). SUMMARY In concluding, I wish to again congratulate the authors on their innovative application of GIS/GPS to area probability sampling of household target populations when time constraints or physical resources and conditions limit the ability to employ conventional area probability sampling methods. My hope is that the critique that that I have presented in my discussion will stimulate additional methodological research on their GIS/GPS-assisted area probability sample designs, focused on the real constraints that exist in resource-limited settings but with an eye to opportunities that the GIS/GPS and Google Earth technology provide to construct designs that are optimal under those constraints. References Bauer J. ( 2016 ), “Biases in Random Route Surveys”, Journal of Survey Statistics and Methodology , 4 , 263 – 287 . Google Scholar CrossRef Search ADS Chang A. Y. , Parrales M. E. , Jimenez J. , Sobieszczyk M. E. , Hammer S. M. , Copenhaver D. J. , Kulkarni R. P. ( 2009 ), “Combining Google Earth and GIS Mapping Technologies in a Dengue Surveillance System for Developing Countries”, International Journal of Health Geographics , 8 , 49 . Google Scholar CrossRef Search ADS PubMed Cochran W. G. ( 1977 ), Sampling Techniques ( 3rd ed.) , New York : John Wiley and Sons . Elwood P.C. ( 1982 ), “Randomized Controlled Trials: Sampling”, British Journal of Clinical Pharmacy , 13 , 631 – 636 . Google Scholar CrossRef Search ADS Escamilla V. , Emch M. , Dandalo L. , Miller W. C. , Martinson F. , Hoffman I. ( 2014 ), “Sampling at community level using satellite imagery and geographical analysis”, Bulletin of the World Health Organization , 92 , 690 – 694 . Google Scholar CrossRef Search ADS PubMed Heeringa S. G. , Connor J. H. , Haeussler J. S. , Redmond, and G. B. , Samonte J. E. ( 1994 ), “1990 SRC National Sample: Design and Development”, Survey Research Center Technical Report, Institute for Social Research, University of Michigan, Ann Arbor, MI. Heeringa S. , O’Muircheartaigh C. ( 2010 ), “Sampling Designs for Cross-Cultural and Cross-National Survey Programs”, in Survey Methods in Multinational, Multiregional and Multicultural Contexts , ed. Harkness , et al., pp. 251 – 268, New York : John Wiley and Sons . Google Scholar CrossRef Search ADS Heeringa S. , Ziniel S. ( 2012 ), Sample Design and Procedures for Hepatitis B Immunization Surveys: A Companion to the WHO Cluster Survey Manual. WHO/ivb/11.12. Heeringa S. G. , West B. T. , Berglund P. A. ( 2017 ), Applied Survey Data Analysis, Second Edition , London : Chapman and Hall . Keiding N. , Louis T. A. ( 2016 ), “Perils and Potentials of Self-Selected Entry to Epidemiological Studies and Surveys”, Journal of the Royal Statistical Society, Series A , 179 , 319 – 376 . Google Scholar CrossRef Search ADS Kish L. ( 1965 ), Survey Sampling , New York : John Wiley and Sons . Kessler R. C. , Haro J. M. , Heeringa S. G. , Pennell B.-E , Bedirhan Üstün T. ( 2006 ), “The World Health Organization World Mental Health Survey Initiative . Epidemiologia e Psichiatria Sociale , 15 , 161 – 166 . Google Scholar CrossRef Search ADS PubMed Laing T. J. , Schottenfeld D. , Lacey J. V. Jr , Gillespie B. W. , Garabrant D. H. , Cooper B. C. , Heeringa S. G. , Alcser K. H. , Mayes M. D. ( 2001 ), “Potential Risk Factors for Undifferentiated Connective Tissue Disease Among Women: Implanted Medical Devices,” American Journal of Epidemiology , 154 , 610 – 617 . Google Scholar CrossRef Search ADS PubMed Langa K. M. , Plassman B. L. , Wallace R. B. , Herzog A. R. , Heeringa S. G. , Ofstedal M. B. , Burke J. R. , Fisher G. G. , Fultz N. H. , Hurd M. D. , Potter G. G. , Rodgers W. L. , Steffens D. C. , Weir D. R. , Willis R. J. ( 2005 ), “The Aging, Demographics and Memory Study: Study Design and Methods,” Neuroepidemiology , 25 , 181 – 191 . Google Scholar CrossRef Search ADS PubMed Moussavi S. , Chatterji S. , Verdes E. , Tandon A. , Patel V. , Ustun B. ( 2007 ), “Depression, Chronic Disease and Decrements in Health: Results from the World Health Surveys,” The Lancet , 370 , 851 – 858 . Google Scholar CrossRef Search ADS Pietrzak R. H. , Tracy M. , Galea S. , Kilpatrick D. G. , Ruggiero K. J. , Hamblen J. L. , Southwick S. M. , Norris F. H. , ( 2012 ), “Resilience in the Face of Disaster: Prevalence and Longitudinal Course of Mental Disorders Following Hurricane Ike,” PLoS One , 7 , e38964 . Google Scholar CrossRef Search ADS PubMed Singh G. , Clark B. D. ( 2013 ), “Creating a Frame: A Spatial Approach to Random Sampling of Immigrant Households in Inner City Johannesburg,” Journal of Refugee Studies , 26 , 126 – 144 . Google Scholar CrossRef Search ADS Thompson S. K. ( 1992 ), Sampling , New York : John Wiley and Sons . Tourangeau R. , Edwards B. , Johnson T. J. , Wolter K. M. , Bates N. , eds ( 2014 ), Hard-to-Survey Populations , Cambridge, UK: Cambridge University Press . Google Scholar CrossRef Search ADS Wampler P. J. , Rediske R. R. , Molla A. R. 2013 , “Using Arcmap, Google Earth, and Global Position Systems to Select and Locate Random Households in Rural Haiti” , International Journal of Health Geographics , 12,3, online, https://doi.org/10.1186/1476-072X-12-3, last accessed March 1, 2018. © The Author(s) 2018. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Survey Statistics and Methodology Oxford University Press

Discussion of “Probability Sampling by Connecting Space with Households Using GIS/GPS Technologies” by Chen, X.; Xu, X.; Gong, J.; Yan, Y.; Fang, L.

Loading next page...
 
/lp/ou_press/discussion-of-probability-sampling-by-connecting-space-with-households-7Uw2w2QZdq
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
2325-0984
eISSN
2325-0992
D.O.I.
10.1093/jssam/smy004
Publisher site
See Article on Publisher Site

Abstract

Abstract Many innovative, new sampling plans that have been proposed for use in “resource limited” settings use GIS, GPS and Google Earth technology. In this issue of JSSAM, Chen, et al. describe a GIS/GPS-assisted area probability sampling method and its application to an epidemiological study designed to investigate the relationship between social capital and HIV risk behaviors among rural immigrants to the city of Wuhan, China. The following discussion examines the strengths and weaknesses of the Chen et al. sampling plan and related applications of GIS/GPS/Google Earth technology to area probability sampling of household target populations. The aim is to stimulate additional methodological research on GIS/GPS-assisted area probability sample designs, focused on the real constraints that exist in resource limited settings but with an eye to opportunities that the GIS/GPS and Google Earth technology provide to construct designs that are optimal under those constraints. INTRODUCTION Despite debate over the need for external validity in some forms of epidemiological research (Keiding and Louis 2016), there is no question that probability sample surveys are important tools in studies of disease prevalence and general health-related characteristics of target populations. In the United States, the National Health Interview Survey (NHIS) and the National Health and Nutrition Examination Survey (NHANES) provide critical longitudinal data series on the health of US adults and children. Similar national health survey programs in countries around the globe are fielded independently by national health ministries or as part of multinational programs of coordinated health surveys (Kessler Haro, Heeringa, Pennell, and Bedirhan 2006; Moussavi, Chatterji, Verdes, Tandon, Patel, et al. 2007). Beyond functioning as simple tools for “surveillance” of population health, probability sample surveys that embed retrospective case-control designs (Laing, Schottenfeld, Lacey, Gillespie, Garabrant, et al. 2001) or serve as a baseline for a prospective cohort study (Langa, Nlassman, Wallace, Herzog, Heeringa, et al. 2005) or even a randomized control trial (Elwood 1982) are employed in epidemiological studies of the incidence and etiology of health conditions or the effectiveness of programs designed to address adverse health outcomes. Chen et al. describe the application of a GIS/GPS-assisted area probability sampling method in an epidemiological study designed to investigate the relationship between social capital and HIV risk behaviors among rural immigrants to the city of Wuhan, China. I am very pleased to see this paper appear in JSSAM for several reasons. First, the paper highlights the importance that advances in geographic information system (GIS) software, global positioning systems (GPS) and satellite, aerial and other geo-imaging tools, such as Google Earth and Street View, can now play in probability sample designs for households and other populations. In the decades before the advent of modern GIS software and GPS, satellite imagery and remote sensing technology were used to guide sample design and improve estimation for surveys of agricultural production and natural resources. A leading example in such early use of advanced technology is the Landsat program, which was launched in 1972 and is still actively gathering data (https://landsat.usgs.gov/; last accessed March 1, 2018). Historically, innovative spatial sampling and estimation methods based on geographic coordinate systems have been widely used in the fields of geology, natural resources, and wildlife population dynamics (Thompson 1992). The US Bureau of Census’s public release of the Topologically Integrated Geographic Encoding and Referencing (TIGER) digitized maps (data and shape files) at the time of the 1990 decennial census revolutionized the selection of area probability samples for US households. Precursors to today’s sophisticated software tools such as ArcGIS (https://www.arcgis.com; last accessed March 1, 2018) enabled sample designers to use these digitized map resources in combination with census and other geographic data to visualize spatial distributions of populations and to improve the efficiency of sample stratification and sample allocation in multistage area probability sample designs (Heeringa, Connor, Haeussler, Redmond, and Samonte 1994). The sophistication and usability of commercially available GIS software advanced rapidly in the 1990s, and the Global Positioning System (GPS), which had been created by the United States Department of Defense (DOD) in the 1970s, became more widely used for navigation by the general public. However, civilian devices for GPS geolocation and navigation were subject to a perturbation of coordinates (roughly fifty meters) termed “selective availability.” Hence, we saw in the 1990s the “zig zag” pattern for linear street displays on maps produced from the original TIGER files. The US government turned off “selective availability” to GPS satellites in May of 2000, enabling further rapid advances in civilian applications for GPS and GIS-related technology. The advent of Google Earth in 2001 was one such advance. Google Earth integrated overlapping satellite images with the global GPS coordinate system, enabling users to locate and visualize areas of the Earth’s surface with a high degree of resolution—a resolution fine enough to locate and assign coordinates to individual structures as small as a typical dwelling unit. See for example Wampler et al. (2013) application of Google Earth in a sample design for a study of rural households in Haiti. A second reason that I appreciate seeing this paper in publication is that it exposes survey statisticians and methodologists to emerging ideas for identifying sample subjects in research settings where resources needed to design and select probability samples of the target population are limited. Papers describing such new and innovative methods for sample selection are often published in specialized or disciplinary journals that are not commonly read by most of us who are more in the mainstream of survey statistics and methodology. This may be due to the particular scientific focus and “one-off” nature of many of the research applications for these designs. And it may possibly be due to some submission and editorial bias in our more familiar journals against papers that are largely descriptive of new design methods. There is no formal definition of what constitutes such a “resource limited” or “resource poor” survey design. Here, a useful working definition includes design challenges where the abilities to efficiently construct appropriate sample frames or operationalize the steps required to develop a probability sample of the target population are restricted. In the growing fields of research on outbreaks of epidemics or postdisaster needs assessment (Chang, Parrales, Jimenez, Sobiesszczyk, Hammer, et al. 2009; Pietrzak, Tracy, Galea, Kilpatrick, Ruggerio, et al. 2012), a limiting factor is time—sample designs that are highly adaptable and quick to implement must be employed. Although less common than thirty years ago, there are still household populations for which accurate census data and maps needed to select conventional area probability samples do not exist or are not shared with civilian populations. More commonly, as is the case in the Wuhan City study of HIV risk among recent rural to urban migrants conducted by Chen et al., existing census data, maps, registries, or administrative lists may be obsolete or uninformative for mobile, rare, or hard-to-survey (HTS, Tourangeau, Edwards, Johnson, Wolter, and Bates 2014) target populations. Many of the new sampling plans that have been proposed for use in such “resource limited” settings use GIS, GPS, and Google Earth technology in various ways. In settings where residential structures are easily distinguished in high-resolution satellite views, Google Earth can be used to create a frame of eligible units. Field staff can then use exact GPS coordinates obtained from the Google Earth view of the study area to identify sample units on the ground (Wampler et al. 2013; Escamilla et al. 2014). Other GIS/GPS-assisted approaches such as that described by Chang et al. (2009) employ a variation of area probability sampling and use the satellite views in Google Earth to physically divide the study area into primary stage sampling units (PSUs) with recognizable boundaries. Dwelling units located within the boundaries of sampled PSUs are then enumerated by field teams and screened or subjected to an additional stage of sampling in much the same fashion as enumeration areas or area segments are listed and sampled in more conventional area probability sample designs. Singh and Clark (2013) describe a spatial sampling of residential structures and dwellings in Johannesburg that employs GIS and locally available geo-data bases to stratify the area in which target populations reside and enumerate structures for contact and screening. The sampling plan and sample selection software described by Chen et al. make maximum use of GIS/GPS and Google Earth in selecting a multistage sample of dwellings from a defined target population area. In doing so, they use technology to bypass several of the more time-consuming and resource-intensive activities required to develop the multistage sample frame and implement the sequence of sampling steps required to cluster, enumerate, sample, and screen a household population. To rigorously evaluate the strengths and weaknesses of this GIS/GPS assisted design method, it is useful to contrast the major sample selection steps and design features of their approach to that of a traditional area probability sample for a household target population. The aim in organizing the discussion in this way is to emphasize strengths and weaknesses of their GIS/GPS-assisted sampling plan and point out several places where the design might be modified to address identified weaknesses. Table 1 compares a traditional area sample design with the GIS/GPS-assisted method. Table 1. Comparison of Conventional Multistage Area Probability Sampling with the GIS/GPS Assisted Approach of Chen et al. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Table 1. Comparison of Conventional Multistage Area Probability Sampling with the GIS/GPS Assisted Approach of Chen et al. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Stage of sample design Standard area probability sampling approach GIS/GPS-assisted area sampling approach (Chen et al.) Primary stage sample of PSUs Using available maps, often accessed as digital maps with GIS technology, all land area in the study population is uniquely divided into A PSUs with clearly identifiable boundaries (e.g., counties, enumeration areas). Using census or administrative data, each geographically defined PSU is assigned a measure of size in the form of a population or dwelling count. All PSUs are assigned to primary stage strata and one or more PSUs are selected from each stratum with probability proportionate to the PSU measure of size. Using GPS/GIS resources, all land area in the study population is uniquely divided into A PSUs by overlaying a symmetric grid (e.g., 100 m2 cells). A probability sample of grid areas is selected and the boundaries of each of the PSUs are entered into field workers’ GPS units. As proposed, the sample of grid units can be a random sample with probability f1 = a/A, or grids may be inspected using satellite or ground imagery, stratified (e.g., by residential density) into H strata and chosen with probability f1h = ah/Ah. Second stage sample of area segments Using available maps or onsite field workers, the geographic area of each selected PSU is divided into non-overlapping smaller geographic areas termed “area segments,” each with a minimum size measured in households or population. The geographic boundaries of each area segment are fixed. One or more area segments is subsampled from each PSU with probability proportionate (PPS) to its measure of size. Within each selected PSU from the grid, a single area segment of residential dwellings is identified. The boundaries of this subselected area are not predefined. They will be determined only when the random walk (see below) enumeration of dwellings is complete. The boundaries and area of the unit will be calculated from GPS measurements made by the field staff member. Third stage sample of dwellings/households All dwelling units within the selected area segment are enumerated by field staff. A random (usually systematic) sample of dwellings is selected from the enumerative list at a prespecified rate. Beginning at an identifiable starting point within the PSU geounit, a random walk procedure is used to identify dwellings to contact and screen. The random walk continues until the desired cluster size of eligible households has been enumerated or the walk is terminated for other reasons. Final stage sample of respondents Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. Sampled dwellings are contacted and rosters are completed of all eligible household members. Respondents are randomly selected from eligible individuals recorded on the household rosters. The second column of table 1 outlines the general approach to the selection of area probability samples for surveying household populations in settings where resources are not extremely limited. With modest adaptation to local geopolitical and population census methods, similar multistage designs are used world-wide for in-person surveys (Heeringa and O’Muircheartaigh 2010). The GIS/GPS-assisted design described in Chen et al. is also a multistage area probability sample design in which the initial two stages of sampling involve selection of area units. The third stage is the selection of households, and the final stage is a selection of respondents from household rosters. Assuming a fixed study budget to minimize sampling error and sampling bias in the selected sample, an area probability sampling plan can be evaluated using several criteria: achieving high coverage of the study target population controlling the true inclusion probability for sample cases maintaining optimal average size of PSU-level (ultimate) clusters and minimizing variation in this cluster size across PSUs minimizing non-informative variation in the final inclusion probabilities and corresponding weights used in analysis To review the advantages and disadvantages of the various sampling stages of their GIS/GPS-assisted sampling plan, it is convenient to use the authors’ example of a study of social capital and HIV risk in rural migrants to Wuhan, China. A primary aim of that study was to estimate the total number of resident rural migrants to Wuhan that are HIV positive—an epidemiological prevalence study that is well matched to a probability sample survey approach. PRIMARY STAGE OF SAMPLING UNDER THE GIS/GPS-ASSISTED APPROACH Area probability sampling draws its name from the fact that in the initial stages of sample frame development, the sampling units are defined non-overlapping geographic units. With the assumption that all households and individuals can be linked 1:1 (or with known multiplicity) to each such geographic parcel, the probability sampling of areas is also a probability sampling of the population linked to those areas. At the primary stage of sample selection, the grid-overlay system employed in the authors’ GIS/GPS-assisted method has the advantage that it does in theory provide complete coverage of the land area in the Wuhan City survey population. The authors describe in some detail the key initial step in their GIS/GPS sampling plan of determining study sample size (n) and the optimal allocation of the sample to each of the multiple stages of selection. The steps in sample size determination and multistage sample allocation do not differ from standard practice when the GIS/GPS-assisted sampling plan is used. Subject to total cost constraints, the total actual sample size is determined by the effective sample size needed to achieve precision targets for key survey estimates and adjustment for design effects that arise from the features (stratification, clustering, differential selection probabilities) of the complex multistage sample design (Heeringa and Ziniel 2012). The optimal size of the ultimate clusters (and hence the number of PSUs) is guided by cost factors and the intraclass correlation of the key variables in the target population. See Kish (1965) for formulas to calculate the optimal cluster size in multistage samples. In settings where informative data on the distribution of the target population to the geo-unit areas defined by the coordinate grid cannot be obtained, the GIS/GPS-assisted sampling plan uses simple random sampling or stratified random sampling to select an equal probability sample of those PSUs for further stages of sampling. The primary-stage inclusion probabilities for GPS-defined geo-units (e.g., 100 m2 areas) selected under each of these equal probability sampling methods are readily established. As the authors point out, due to heterogeneity in the distribution of the target population over the study area (see table 1 of their paper), equal probability sampling of the GPS-defined PSUs at this initial stage introduces substantial variation into the multistage chain of probabilities that will determine final selection probabilities and associated sampling weights for individual respondents. In most survey design settings, such random variability in inclusion probabilities for elements of the target population will result in increased variance of sample estimates of population statistics relative to a self-weighting sample of equivalent size. For example, in the Wuhan City study of rural-to urban migrants, the mean weight value for respondents is 57.57 with a standard deviation of 126.13. Assuming the weight values are independent of the survey measure of interest (e.g., HIV risk), the variance of an estimated mean or proportion based on that measure is approximately 1+ cv2(W) or 1+ (126.13/57.57)2 = 5.8 times that for a self-weighting sample of equal size (Kish 1965, Chapter 11). If GIS/GPS-assisted grid-sampling designs of this type are used in practice, it is important to take every possible step to minimize unnecessary variation in sample selection probabilities for eligible individuals in the target population. In multistage area probability sample designs for general household surveys, control over final selection probabilities for sample households is most efficiently achieved by sampling area-based units (PSUs and area segments) with probability proportionate to size (PPS) and selecting dwellings within area segments at a rate that is inversely proportional to the probability of selection of area units in the preceding stages. See, for example, the design for a study of HIV risk among immigrant populations in Johannesburg, South Africa, that is described by Singh and Clark (2013). However, this technique of sampling PSUs and area segments with PPS and dwellings with probability inversely proportionate to size only works well if the assigned measures of size for the area units are reasonably accurate. For the Wuhan City study design, the ideal measures of size for PSUs and area segments would be the number of rural migrant households that reside in each area unit. These population data were not available for the Wuhan City study population. In such cases, an alternative suggested by the authors is to use Google Earth satellite imagery to visually examine the target population area in more detail. The visual examination of high-resolution images of the study area may serve to assign approximate measures of size (e.g., structure counts and dwelling counts) to each geounit PSU or to stratify the grid-based PSUs according to the approximate size of the target population within the PSUs’ geographic boundaries (Cochran 1977). If stratification of PSUs by size is employed, the allocation of sample PSUs to each size stratum should then be set approximately proportionate to the expected share of the target population residing in each stratum. This stratification by size of the target population within the PSU will attenuate but not eliminate the unwanted variation in inclusion probabilities for the multistage sample of households and individuals (Cochran 1977). Following the selection of PSUs from the GIS grid overlay on the study area, the coordinates of the selected area units are loaded onto a GPS device that is used by the field staff member to locate the exact boundaries of a sampled PSU. This is where the two-dimensional world of most GIS mapping software and databases meets the irregular, three-dimensional reality of residential housing arrangements. Unlike conventional area probability samples where area sample units such as PSUs and area segments are defined according to distinct, recognizable geographic boundaries, the precise lines of the GPS-determined boundaries for selected PSUs, when projected onto the study area, will intersect streets and other boundaries at odd angles, and in urban areas, the boundaries are likely to bisect residential buildings (see Figure A.2 in the appendix to the Chen et al. paper). Field staff will need to be supplied with objective rules and carefully trained to make 1:1 assignments of dwellings to PSUs at this stage in order to avoid potential problems of selective coverage of dwellings on the PSU boundaries. This is clearly an area of needed methodological research and development as GIS/GPS-assisted area sampling strategies of the proposed type are employed in surveys of household populations. SECOND, THIRD AND FINAL STAGES OF SAMPLING IN THE GIS/GPS-ASSISTED APPROACH After the field staff have identified the boundaries of the PSU using their GPS device, they initiate the second stage of sample selection. An innovative but complicating feature of the authors’ GIS/GPS-assisted sampling plan for household populations lies in its use of a random walk method to sample and screen a cluster of eligible households and respondents (see the third stage in table 1). This random walk method of household enumeration within the sample PSU also defines the size and location of the second stage area segment from which households are sampled (see the second stage in table 1). As described in the paper, the random walk begins at “the main entrance of a street in urban areas or the beginning of a village in rural areas.” The field interviewer continues the random walk, screening and identifying households with eligible respondents until a fixed target of eligible households has been identified or an identifiable termination point (e.g., a new street intersection) is reached. At that point, the interviewer terminates their random walk enumeration. As the authors note, this methodology has the advantage of creating balance in the interviewer workloads across PSUs—the random walk route is terminated when the desired number of eligible individuals has been identified. By allowing the physical size of the second stage area segment to expand until the target number of eligible households is identified, it also offsets some of the inequality in selection probabilities for the target population (and respondent weights) that is due to selection of PSUs without controlling for the associated size of the eligible population. Note, however, by using the random walk and allowing the size of the enumerated area segment within each PSU to expand until a target number of eligible households has been identified, the probability of selection, f2, for the second stage area segment cannot be directly calculated—it must be estimated. The authors propose two methods for estimating f2, both of which require the field interviewer to use their GPS instrument to accurately outline the boundaries of the residential area that they covered in their random walk. This geo-positioning data will be used later by the central office to estimate the size of the actual residential area covered by the random walk route and the probability of selection for the area segment covered by the random walk within the PSU. Mathematically, the two ratio methods for estimating the area segment selection probability are reasonable. The critical variables in this process are the rules by which the local interviewer performs the local measurements of the area segment boundaries, which in turn lead to an estimated residential area, Ag, for the enumerated area segment. In my view, a major weakness of the proposed GIS/GPS-assisted sampling plan is the use of the random walk methodology to enumerate and select sample dwellings within sampled geo-units. Although the random walk is widely used in resource-limited sampling of household populations for epidemiological and other studies, published research by Bauer (2016) and earlier research summarized in Heeringa and Ziniel (2012) suggests that despite their attempt at “randomness,” random walk protocols do not achieve the objective of sampling households with known—let alone equal—probability. Determination of individual selection probabilities becomes even more ambiguous when the random walk is allowed to meander outside the fixed boundaries of the sample PSU (see Figure A.2 in the appendix to the Chen et al. paper). Although this GIS/GPS-assisted sampling plan is designed to be employed in resource-limited settings, there is an alternative to the random walk that requires only modest additional work by the field staff and which should eliminate much of the serious noncoverage and uncertainty over inclusion probabilities that can arise with the random walk method of enumeration. The alternative is the technique of “segmenting” a larger area sample unit in the field and randomly selecting segmented subparts of the area for the intensive tasks of dwelling enumeration, household sampling, and screening (see Kish 1965, Section 9.7D). This is a technique that is still widely used in area probability sampling in large rural areas and in urban areas where defined area sample units contain very large numbers of dwellings. Subparts of the original geounit with identifiable boundaries can be defined on location by the field staff and randomly ordered for household enumeration and screening. To achieve control over the cluster size of eligible households within each primary stage geo-unit, randomly ordered subparts can be introduced sequentially to the enumeration until the target cluster size for the geounit PSU is met or exceeded. Once a subpart is introduced to the second stage sample for the PSU, all dwellings in that subpart should be enumerated. If the enumeration produces a larger number of eligible households than desired, individual households can be subsampled prior to final household and respondent selection. In the Wuhan City study, the authors recommend the use of standard methods for completing a roster of eligible household members and randomly selecting one or more individuals and survey respondents. This methodology is consistent with the standard practice for conventional area probability surveys. SURVEY WEIGHTING AND ESTIMATION UNDER THE GIS/GPS-ASSISTED APPROACH The authors propose the following expression for calculating individual sample selection weights for respondents selected using the multistage GIS/GPS-assisted design. The individual weights calculated using this formula may in turn be used to compute design-based, weighted estimates of descriptive statistics for the target population. W^i=     Wg     ⋅     Wgh⋅    Wghi      =R^a⋅Ag   ⋅   TgHg   ⋅   Nghnghwhere:Wg=the combined weight factor for sampling of PSUαand area segmentg;Wgh=the weight factor for sampling householdhfrom area segmentg;Wghi=the weight factor for sampling respondentiin householdh;R^=the Method1or2estimate of total residential area in the survey population;a=total number of primray stage selections,α=1,…,a;Ag=field staff measure of random walk area in area segmentgand PSUα;Tg= total number of enumerated households in area segmentg;Hg=total number of sampled households in area segmentg;Ngh=the total number of eligible individuals in householdh;ngh=the total number of randomly sampled individuals in householdh. A “^” is placed over the Wi since the expression for this weight includes a quantity, R^ that must be estimated from field observations that determine the value of Ag. As noted above, the total error in the assigned weights will depend in large part on the consistency and accuracy of field staffs’ GPS coordinate measures that are used to determine the individual values of Ag. Assuming probability sampling and accurate determination of the sample inclusion probabilities and corresponding weights, the multistage design for authors’ GIS/GPS-assisted sampling method is fully compatible with the available alternatives for robust estimation of sampling variances for complex samples (Heeringa, West, and Berglund 2017). SUMMARY In concluding, I wish to again congratulate the authors on their innovative application of GIS/GPS to area probability sampling of household target populations when time constraints or physical resources and conditions limit the ability to employ conventional area probability sampling methods. My hope is that the critique that that I have presented in my discussion will stimulate additional methodological research on their GIS/GPS-assisted area probability sample designs, focused on the real constraints that exist in resource-limited settings but with an eye to opportunities that the GIS/GPS and Google Earth technology provide to construct designs that are optimal under those constraints. References Bauer J. ( 2016 ), “Biases in Random Route Surveys”, Journal of Survey Statistics and Methodology , 4 , 263 – 287 . Google Scholar CrossRef Search ADS Chang A. Y. , Parrales M. E. , Jimenez J. , Sobieszczyk M. E. , Hammer S. M. , Copenhaver D. J. , Kulkarni R. P. ( 2009 ), “Combining Google Earth and GIS Mapping Technologies in a Dengue Surveillance System for Developing Countries”, International Journal of Health Geographics , 8 , 49 . Google Scholar CrossRef Search ADS PubMed Cochran W. G. ( 1977 ), Sampling Techniques ( 3rd ed.) , New York : John Wiley and Sons . Elwood P.C. ( 1982 ), “Randomized Controlled Trials: Sampling”, British Journal of Clinical Pharmacy , 13 , 631 – 636 . Google Scholar CrossRef Search ADS Escamilla V. , Emch M. , Dandalo L. , Miller W. C. , Martinson F. , Hoffman I. ( 2014 ), “Sampling at community level using satellite imagery and geographical analysis”, Bulletin of the World Health Organization , 92 , 690 – 694 . Google Scholar CrossRef Search ADS PubMed Heeringa S. G. , Connor J. H. , Haeussler J. S. , Redmond, and G. B. , Samonte J. E. ( 1994 ), “1990 SRC National Sample: Design and Development”, Survey Research Center Technical Report, Institute for Social Research, University of Michigan, Ann Arbor, MI. Heeringa S. , O’Muircheartaigh C. ( 2010 ), “Sampling Designs for Cross-Cultural and Cross-National Survey Programs”, in Survey Methods in Multinational, Multiregional and Multicultural Contexts , ed. Harkness , et al., pp. 251 – 268, New York : John Wiley and Sons . Google Scholar CrossRef Search ADS Heeringa S. , Ziniel S. ( 2012 ), Sample Design and Procedures for Hepatitis B Immunization Surveys: A Companion to the WHO Cluster Survey Manual. WHO/ivb/11.12. Heeringa S. G. , West B. T. , Berglund P. A. ( 2017 ), Applied Survey Data Analysis, Second Edition , London : Chapman and Hall . Keiding N. , Louis T. A. ( 2016 ), “Perils and Potentials of Self-Selected Entry to Epidemiological Studies and Surveys”, Journal of the Royal Statistical Society, Series A , 179 , 319 – 376 . Google Scholar CrossRef Search ADS Kish L. ( 1965 ), Survey Sampling , New York : John Wiley and Sons . Kessler R. C. , Haro J. M. , Heeringa S. G. , Pennell B.-E , Bedirhan Üstün T. ( 2006 ), “The World Health Organization World Mental Health Survey Initiative . Epidemiologia e Psichiatria Sociale , 15 , 161 – 166 . Google Scholar CrossRef Search ADS PubMed Laing T. J. , Schottenfeld D. , Lacey J. V. Jr , Gillespie B. W. , Garabrant D. H. , Cooper B. C. , Heeringa S. G. , Alcser K. H. , Mayes M. D. ( 2001 ), “Potential Risk Factors for Undifferentiated Connective Tissue Disease Among Women: Implanted Medical Devices,” American Journal of Epidemiology , 154 , 610 – 617 . Google Scholar CrossRef Search ADS PubMed Langa K. M. , Plassman B. L. , Wallace R. B. , Herzog A. R. , Heeringa S. G. , Ofstedal M. B. , Burke J. R. , Fisher G. G. , Fultz N. H. , Hurd M. D. , Potter G. G. , Rodgers W. L. , Steffens D. C. , Weir D. R. , Willis R. J. ( 2005 ), “The Aging, Demographics and Memory Study: Study Design and Methods,” Neuroepidemiology , 25 , 181 – 191 . Google Scholar CrossRef Search ADS PubMed Moussavi S. , Chatterji S. , Verdes E. , Tandon A. , Patel V. , Ustun B. ( 2007 ), “Depression, Chronic Disease and Decrements in Health: Results from the World Health Surveys,” The Lancet , 370 , 851 – 858 . Google Scholar CrossRef Search ADS Pietrzak R. H. , Tracy M. , Galea S. , Kilpatrick D. G. , Ruggiero K. J. , Hamblen J. L. , Southwick S. M. , Norris F. H. , ( 2012 ), “Resilience in the Face of Disaster: Prevalence and Longitudinal Course of Mental Disorders Following Hurricane Ike,” PLoS One , 7 , e38964 . Google Scholar CrossRef Search ADS PubMed Singh G. , Clark B. D. ( 2013 ), “Creating a Frame: A Spatial Approach to Random Sampling of Immigrant Households in Inner City Johannesburg,” Journal of Refugee Studies , 26 , 126 – 144 . Google Scholar CrossRef Search ADS Thompson S. K. ( 1992 ), Sampling , New York : John Wiley and Sons . Tourangeau R. , Edwards B. , Johnson T. J. , Wolter K. M. , Bates N. , eds ( 2014 ), Hard-to-Survey Populations , Cambridge, UK: Cambridge University Press . Google Scholar CrossRef Search ADS Wampler P. J. , Rediske R. R. , Molla A. R. 2013 , “Using Arcmap, Google Earth, and Global Position Systems to Select and Locate Random Households in Rural Haiti” , International Journal of Health Geographics , 12,3, online, https://doi.org/10.1186/1476-072X-12-3, last accessed March 1, 2018. © The Author(s) 2018. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Journal of Survey Statistics and MethodologyOxford University Press

Published: Apr 18, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off