An Unbalanced Ranked-Set Sampling Method to Get More Than One Sample from Each Set

An Unbalanced Ranked-Set Sampling Method to Get More Than One Sample from Each Set Abstract In ranked-set sampling, the restriction of selecting just one individual from each set may require too many sets. We propose a new version of ranked-set sampling that relaxes this restriction. Our new design uses stratified sampling in which ranked-set sampling is used to form the strata. Simulations, and a real case study on medicinal flowers, show that this design can be more precise and less costly than previous designs. 1. INTRODUCTION Ranked-set sampling (RSS) was first introduced by McIntyre (1952) and has been a widely used design in many applications. It is particularly appealing in agricultural and environmental sciences, where identifying sampling units in the field is straightforward, but exact measurement of the units is time consuming. The RSS technique takes several random samples of size m from the population. The sample units are ranked by some quick and easy measure. Then, one unit from each sample is chosen and precisely measured for the character of interest. Consider m samples of size m. The unit that has the lowest rank among the m units in the first sample is chosen, the unit with the second lowest rank is chosen from the second sample, and so on until the unit with the highest rank is chosen from the mth sample. This process is repeated n times, giving a final sample size, nm. This balanced RSS method can be generalized to unbalanced designs, where the number of sample units selected at each of the ranks is not constant. With highly skewed population distributions, more units from low (or high) ranks can be selected. Unbalanced designs are similar in concept to optimal allocation in stratified sampling where strata with larger variances take larger sample fractions. RSS is reported as being more efficient than simple random sampling (SRS) (Samawi 1996; Ridout 2003). See full reviews of RSS by Patil, Sinha, and Taillie (1999) and the related book of Chen, Bai, and Sinha (2004). In RSS designs, just one sample unit from each ranked-set is selected. Ranking can be based on some auxiliary, easy-to-measure variables, but the cost involved in sampling and ranking cannot be completely ignored. Wang, Chen, and Liu (2004) proposed a design to obtain more than one unit from each ranked-set and showed that the design can be more efficient than RSS. We call their design “L-Tuple RSS.” Ghosh and Tiwari (2008, 2009) considered L-Tuple RSS for estimating distribution functions and presented a general form of design that unifies different versions of RSS. For unbalanced RSS, McIntyre (1952) proposed that the sample size corresponding to each rank order should be proportional to its standard deviation; i.e., an unequal allocation, as realized in the Neyman approach. In this context, Kaur, Patil, and Taillie (1997) note that the variance of order statistics typically increases with the rank orders for positively skewed distributions on (0,+∞). This work is consistent with Neyman’s optimal allocation and proposes two models for unequal allocation for skew distributions. As in Wang’s approach, we introduce a design that relaxes the restriction to one sample unit from each set and the restriction with respect to the number of needed sets. Relaxing these two restrictions leads to fewer sets and a more flexible design. We introduce the new method, namely Virtual Stratified Sampling using RSS in section 2. In section 3, we define two criteria based on cost and precision to evaluate the design. Section 4 contains simulations and a real case study to compare the methods. Section 5 presents conclusions. 2. MORE THAN ONE SAMPLE UNIT FROM EACH SET 2.1 L-Tuple RSS (LTR) Wang et al. (2004) suggested forming Lm,t=(mt) ranked-sets of size m, and then selecting a sample of t units from each set, identified by mutually different ranks. For example, if t = 2, select the units with ranks one and two from the first set and units with ranks one and three from the second set. This process is repeated until each pair of ranks among Lm,2 possible pairs is selected. This design needs fewer ranked-set units than RSS, and Wang et al. (2004) showed that the design is more cost-efficient than RSS. However, the inflexibility of the design in attaining a desired sample size can be a weakness. Given the main parameters of the design (m and t) and the number of cycles (d), implementation requires that we obtain d×Lm,t sets with corresponding sample size given by d×t×Lm,t, which may impose large gaps between the possible sample sizes. For example, with m = 6 and t = 3, Lm,t=20, we can implement the design with 20, 40, 60, 80, etc., sets and sample sizes of 60, 120, 180, 240, etc., corresponding to d=1,2,3,4,…. 2.2 Virtual Stratified Sampling Using Ranked Set Sampling (VSR) We introduce a new strategy in which two, three, or all elements of a set can be selected in the final sample without any restriction to have a balanced form. Suppose that X is the variable of interest with E(X)=μ and V(X)=σ2<∞. The goal is to estimate μ and the variance of the estimator. Suppose further that Y is an auxiliary variable (used for ranking) with E(Y)=μy and V(Y)=σy2<∞. And suppose that Y has reasonable correlation with the main variable X. When it is possible to rank on X directly, we simply take Y = X in this formulation. For i=1,2,…,K, we select the ith set of m units from the population and assume that the corresponding (X, Y) values are independent and identically distributed from their bivariate distribution. We order the Y values for the m sampling units in the ith set as Y[1]i≤Y[2]i≤⋯≤Y[m]i and denote the corresponding X values ordered on Y as X[1]i,X[2]i,⋯,X[m]i. By conditioning on these X and Y values, we have a finite population with m strata, each of size K, as depicted in table 1. From the K units in stratum h of this stratified population, we draw a sample sh of size nh via simple random sampling without replacement (SRSWOR), where 1≤nh≤K and ∑h=1mnh=n•. We obtain the measurements of interest X[h]i for the units in sh. Table 1. Virtual Stratified Population Based on the Auxiliary Variable 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K Table 1. Virtual Stratified Population Based on the Auxiliary Variable 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K The proposed estimator is then μ^VSR=1m∑h=1mX¯[h] where X¯[h]=1nh∑i∈shX[h]i. The link of virtual stratified sampling (VSR) to finite population sampling makes analytic evaluation of VSR and its variants straightforward, as shown below. This analytic tractability is an advantage of VSR over rival designs. For a sample of size m, define μh=E(X| rank(Y)=h) and σh2=V(X| rank(Y)=h). The following theorem provides a simple estimate of variance of μ^VSR. Theorem 1 . In VSR, μ^VSR is an unbiased estimator for μ with Var(μ^VSR)=σ2Km+1m2∑h=1m1−nhKnhσh2 (1)and an unbiased estimator for Var(μ^VSR)is Var^(μ^VSR)=K−1m(mK−1)∑h=1m1nh(nh−1)∑iϵsh(X[h]i−X¯[h])2+1m(mK−1)∑h=1m(X¯[h]−μ^VSR)2 For the proof, see appendix A. If the distribution of the population is skewed, we can allocate nh to oversample strata with large variances and gain additional reduction in the estimator variance; see §2.3 below. Alternatively, we can sample with equal allocation across strata. Corollary 2 . Under VSR with equal allocation, i.e., nh=n;h=1,2,…,m, we have Var(μ^VSR)=1nm[σ2−(1−nK)m∑h=1m(μh−μ)2]. (2) If we assume X and Y are sampled from the linear regression model X=μ+ρxyσσy(Y−μy)+ɛ (3) where ρxy and ɛ are the Pearson correlation and a random variable independent from X respectively, then Var(μ^VSR)=1nm[σ2−ρxy2(1−nK)m∑h=1m(μ(h)−μ)2]. For the proof, see appendix B. It is possible to rewrite (2) as Var(μ^VSR)=σ2nm︸A−1nm2∑h=1m(μh−μ)2︸B+1Km2∑h=1m(μh−μ)2︸C and then we have A: the variance of the sample mean in SRS ( VarSRS) A-B: the variance of the sample mean in RSS ( VarRSS) A-B + C: the variance of the sample mean in VSR ( VarVSR) Now as A, B, and C are nonnegative, C≤B and also lim⁡K→∞C=0, it is easy to conclude the following: VarRSS≤VarSRS and VarVSR≤VarSRS. VarRSS≤VarVSR. But the simulations show with cost consideration, VSR can be more efficient than RSS. lim⁡K→∞VarVSR=VarRSS. It is notable that taking more than one sample from each set is useful when the cost of sampling or ranking is too high. 2.3 VSR with Neyman Allocation (NVSR) To enhance efficiency, we can use Neyman allocation, provided information about the variances of the strata is available. Different measurement costs in different strata can also modify the optimal allocation (Sarndal, Swensson, and Wretman 1992, p. 104). For example, it might be more costly to get precise measurements of the sampled elements in the stratum of the largest order statistic than in the stratum of the smallest order statistic. Here, we consider Neyman allocation when the cost is equal in all strata. NVSR is a special case of VSR with nh=n•σh∑h=1mσh. (4) We denote NVSR mean estimator with μ^NVSR. Theorem 3 shows what there is a gain by combining VSR with the Neyman allocation procedure. Theorem 3 . In VSR with equal allocation strategy, we have the following. Var(μ^VSR)Var(μ^NVSR)≥1. For the proof, see appendix C. Henceforth, VSR stands for VSR with equal allocation. 3. COMPARING VSR WITH LTR 3.1 Variance of the Designs Based on Wang et al. (2004), for the unbiased estimator of μ in LTR with d cycles, we have: Var(μ^LTR)=1dtLm,t[σ2−m−tm−11m∑h=1m(μh−μ)2] Then if we define the efficiency of VSR relative to LTR as efficiency(μ^VSR,μ^LTR)=Var(μ^LTR)Var(μ^VSR) we have efficiency(μ^VSR,μ^LTR)=1dtLm,t[σ2−m−tm−11m∑h=1m(μh−μ)2]1nm[σ2−(1−nK)1m∑h=1m(μh−μ)2] If we define nVSR and nLTR as total sample size to measure in VSR and LTR respectively and KVSR and KLTR as total sets used in VSR and LTR respectively, we have nVSR=nm,      nLTR=dtLm,t,      KVSR=K,      KLTR=dLm,t Now if we set nVSR=nLTR and KVSR=KLTR, it is easy to show that “ efficiency(μ^VSR,μ^LTR)≤1”. In other words, VSR cannot be more efficient than LTR with the same sample size and same number of sets. This happens because LTR separates the final sample units in the sets that have the least covariance between selected units. This is one of the advantages of LTR relative to VSR. But as a disadvantage, LTR is not flexible, meaning that depending on the parameters, either nLTR or KLTR is imposed in the design. On the other hand, VSR is very flexible in setting nVSR and KVSR. We believe VSR can be more efficient than LTR under two scenarios: Using its flexibility, we implement NVSR based on some auxiliary variables information about strata. Based on cost considerations, with nVSR=nLTR we set KVSR less than KLTR. Scenario one uses the stratified aspect of VSR. After the first-stage sample (of size mK), it is easy for a researcher to implement stratified sampling with Neyman allocation. A potential disadvantage of this scenario, if based on the variability of the auxiliary variable, is that the allocation cannot be completed, the final sample cannot be determined, and the final measurements cannot be obtained until all initial rankings and measurements are completed. This means retaining all of the sets until the final measurements are completed, which might be impractical in some applications (e.g., with live animals or time-sensitive materials). Scenario two is the most important advantage of VSR. This scenario helps us to use the advantage of RSS in efficiency without its strict restrictions in sampling. 3.2 Cost Considerations We define cost of the designs as CostLTR=dmLm,tCe+dmLm,tCy+dtLm,tCx for LTR CostVSR=mKCe+mKCy+mnCx for VSR where Ce: Cost for sampling one unit Cx, Cy: Cost for measuring the main variable/auxiliary variable of one unit Now we can define efficiency based on cost as CoEfficiency(μ^VSR,μ^LTR)=Var(μ^LTR)Var(μ^VSR)×CostLTRCostVSR (5) In (5), at the same time, we care about precision and cost. Note that if we rank visually, we should exchange the part of the cost corresponding to measurement of the auxiliary variable with the cost due to ranking. Since the expected number of pairwise comparisons needed for ranking m units is (m+2)(m−1)/4 (Wang et al. 2004), we propose the following two alternative costs: CostLTR*=dmLm,tCe+d(m+2)(m−1)4Lm,tCr+dtLm,tCx for LTR CostVSR*=mKCe+(m+2)(m−1)4KCr+mnCx for VSR where Cr is the cost for each pairwise comparison. 4. SIMULATION STUDIES In this section, we compare VSR and LTR using simulated populations with two scenarios and four distributions, and a real case study on medicinal flowers. In the simulations, at least one sample is selected from each stratum, regardless of its size or variance. All the simulations are done by “R 3.1.2” software. We used a Monte Carlo approach and simulated 20,000 samples for each design option. 4.1 Comparing VSR with LTR Using Some Simulations on Four Distributions We considered four distributions: normal, lognormal, chi-square and exponential. We performed the simulations with different parameters. For all of them, we set m = 5, KVSR=KLTR and nVSR=nLTR so that the costs of VSR and LTR were the same. For sorting observations in the sets to execute RSS and for variance information inside the strata needed in Neyman allocation, we used an auxiliary variable for each distribution, with approximately 0.70 correlation with the survey vaiable. Results are shown in figure 1. As we can see in the case of positively skewed distributions (like lognormal, chi-square and exponential), NVSR is better than LTR. As the distribution approaches a symmetric distribution or a distribution with light tails (as with chi-square and normal), the efficiency of NVSR reduces. Also, in almost all the cases with decreasing t, efficiency of NVSR reduces relative to LTR. Consideration of efficiency then suggests the use of NVSR rather than LTR for asymmetric distributions. Also, we note the increase in efficiency in the case of lognormal, resulting from a heavy right tail. Figure 1. View largeDownload slide Efficiency of Neyman for the 4 Distributions with Parameter v, that Is Changing in Horizontal Axis with Different t with m = 5. Figure 1. View largeDownload slide Efficiency of Neyman for the 4 Distributions with Parameter v, that Is Changing in Horizontal Axis with Different t with m = 5. For scenario two, for exp(2) and lognormal(0,2) we set nVSR=nLTR, m = 5, with t = 1 (figure 2) and t = 2 (figure 3). The costs were based on CostLTR and CostVSR. Given the three parameters LTR, t and m, the other parameters d and KLTR were imposed in LTR. However, VSR is flexible enough to allow selection of KVSR freely. Also, we let the parameter (Cy+Ce)/Cx vary from 0.1 to 20. In all the cases, we set Cx=5Cy. In figures 2 and 3, the left plots show VSR (scenario one), and the right plots show NVSR (scenarios one and two together). As we can see in the figures, by increasing the cost of sampling ( Cy+Ce) relative to the cost of measurement (Cx), CoEfficiency (5) increases. Also, smaller KVSR is more CoEfficient (i.e., Coefficiency ≥1) than larger KVSR. Coefficiency is larger for t = 1 relative to t = 2, because the cost of the former (which needs more sets) is more that the latter in LTR. Combination of the two scenarios (right plots) shows surprising CoEfficiency that rises as high as 10 for some cases for lognormal. Figure 2. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 1. Figure 2. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 1. Figure 3. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 2. Figure 3. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 2. 4.2 Comparing VSR with LTR Using a Real Case Study on Some Medicinal Flowers Matricaria chamomilla, commonly known as chamomile, is an annual plant of the Asteraceae family native to Iran and European countries. Chamomile is considered a very important commercial and medicinal plant. The plant flowers, usually dried, are used in herbal medicine for a sore stomach, irritable bowel syndrome, and as a gentle sleep aid. Chamomile taken as an herbal dried tea is also used as a mild laxative and has anti-inflammatory and bactericidal effects. The health benefits of chamomile are due to the flower’s essential oil content, which could be 0.24 to 2.0 percent of flower dry weight (Gardiner 1999). As a result, achieving the maximum amount of dried flower is critical to maximize essential oil yield. To reach this goal, plant breeders and agronomists need to evaluate and screen a large population of natural or agronomic variations (cultivars) to select the best ones and to improve the population means. Flower yield and essential oil are correlated with several traits, including number of flowers per plant, fresh flower weight and plant height (Letchamo 1993). These correlations could be exploited in a sampling procedure. 4.2.1 Plant material Seeds of a heterogeneous population of chamomile (Matricaria chamomila L.) collected from natural rangelands of Iran were planted in plastic pots with a top diameter of twenty centimeters and a depth of twenty-five centimeters filled with clay loam soil. Pots were arranged in a randomized complete block design with thirty pots in three replications under natural conditions in the field, located at Isfahan University of Technology, Isfahan, Iran. Five seeds were planted in each pot, and plants were irrigated every two to four days after flowering plant height, fresh and dry weight of flowers, peduncle length, number of flowers, head diameter, number of branches, and essential oil content (based on the percent of flower dry weight) were measured on each flower pot. In the next subsection, we use the information of this sample as a population in a simulation to evaluate the designs. 4.2.2 Design and results For the chamomilla data, we executed scenario one and two. We considered the population mean of “Essence” as the main parameter with μ=0.74. Skewness of the data was 2.21. Because we had no information about this quantity before sampling, and it is expensive to measure, we used three auxiliary variables, easy to measure and with reasonable Pearson correlation with the main variable, namely “Flower fresh weight” with correlation 0.33, “Stem height” with correlation 0.53, and “Number of Petals” with correlation 0.71. For Neyman allocation, we used the variance of the auxiliary variable. We set m=3,5,7,9. Results are in figures 4 and 5. Figure 4. View largeDownload slide Efficiency of NVSR on Real Data for Different m, t and Correlations. Figure 4. View largeDownload slide Efficiency of NVSR on Real Data for Different m, t and Correlations. Figure 5. View largeDownload slide CoEfficiency of VSR on Real Data for Different t and K with m = 5. Figure 5. View largeDownload slide CoEfficiency of VSR on Real Data for Different t and K with m = 5. In figure 4, we see the results of efficiency for Neyman allocation. In all the cases, we set KVSR=KLTR and nVSR=nLTR. In all the cases, NVSR is better than LTR. With increasing m, efficiency increases. In most of the cases, higher correlation is associated with higher efficiency, because it leads to an allocation closer to Neyman allocation. In figure 5, we see scenario two. Here again, the flexibility of VSR allows for less sets than LTR. We set nVSR=nLTR, m = 5 and t=1,2,3,4. The costs were based on costLTR and costVSR, and for the auxiliary variable “Number of Petals” was used. Also we let the relative cost Cy+CeCx vary from 0.1 to 5. In all the cases with increasing cost of sampling relative to measurement, Coefficiency increases. Coefficiency is larger for smaller KVSR. Figure 6 is presented to investigate the case of visual ranking based on costLTR* and costVSR*. For ranking, the auxiliary variable with 0.71 correlation was used, and we set nVSR=nLTR, Cr=Cx/20, t = 2 and m=3,4,5,6. Coefficiency is larger for smaller KVSR and with increasing cost of sampling and ranking relative to measurement, Coefficiency increases. Choosing smaller m reduces the efficiency of the stratification but reduces ranking costs, while larger m increases the efficiency of the stratification but with increased ranking costs. As we can see in figure 6, Coefficiency generally increases, showing that VSR is more successful than LTR in trading off between costs and efficiency. Figure 6. View largeDownload slide CoEfficiency of VSR on Real Data for Different m and K with t = 2 based on costLTR*and costVSR*. Figure 6. View largeDownload slide CoEfficiency of VSR on Real Data for Different m and K with t = 2 based on costLTR*and costVSR*. 5. CONCLUSION Our proposed new strategy is an easy and inexpensive version of RSS, allowing good control of the sample size within each stratum. The design is easy to implement and analytically tractable. Controlling the sample size enables us to use Neyman allocation leading to significant efficiency with a reasonable auxiliary variable. VSR just requires choosing a number of sets, K, larger than n, and the strategy is better than SRS. In brief, the design is efficient, easy to perform, easy to calculate, and allows good control of costs. Appendix A. Proof of Theorem 1 First we prove two important identities: μ=E(X1)=E(E(X1|rank(X1)))=1m∑h=1mE(X1|rank(X1)=h)=1m∑h=1mμh (6) σ2=Var(E(X1|rank(X1)))+E(Var(X1|rank(X1))) (7) =Var[∑h=1mμhI(rank(X1)=h)]+E[∑h=1mσh2I(rank(X1)=h)] =1m∑h=1m(μh−μ)2+1m∑h=1mσh2 where rank(X1) indicates rank of X1 in its selected set and I(rank(X1)=h) is an indicator function which takes 1, if rank(X1)=h. Now with μ^VSR=1m∑h=1mX¯[h]=1mK∑h=1m∑i∈shX[h]inh/K=1mK∑h=1m∑i=1KX[h]iIhinh/K where Ihi={1    if  i∈sh,0    otherwise, we have E(μ^VSR)=E[E(μ^VSR|XKm,YKm)]=E[1mK∑h=1m∑i=1KX[h]iE(Ihi|XKm,YKm)nh/K] =E(1mK∑h=1m∑i=1KX[h]inh/Knh/K)=1mK∑h=1m∑i=1KE(Xhi)=μ where XKm,YKm are for the entire finite population of size Km. For the variance we have Var(μ^VSR)=V[E(μ^VSR|XKm,YKm)]+E[V(μ^VSR|XKm,YKm)]. The first term V[E(μ^VSR|XKm,YKm)]=V[X¯Km]=σ2Km, The second term E[V(μ^VSR|XKm,YKm)]=E[1(mK)2∑h=1mK21−nhKnhShK2]=1m2∑h=1m1−nhKnhσh2 where ShK2=1K−1∑i=1K(X[h]i−X¯[h]K)2,   X¯[h]K=1K∑i=1KX[h]i and then Var(μ^VSR)=σ2Km+1m2∑h=1m1−nhKnhσh2 For proving the unbiasedness of the Var^(μ^VSR), we use some ideas of MacEachern, Ozturk, Wolfe, and Stark 2002). First note that in the design from (6) and (7) we have Var(μ^VSR)=1m2∑h=1mσh2nh+1m2K∑h=1m(μh−μ)2 and E(∑iϵsh(X[h]i−X¯[h])2)=∑i=1KE(X[h]i2Ihi)−nhE(X¯[h]2) =∑i=1KnhK(Var(X[h]i)+E2(X[h]i))−nh(Var(X¯[h])+E2(X¯[h])) =nh(σh2+μh2)−nh(1−nh/Knhσh2+1Kσh2+μh2)=(nh−1)σh2 and also E(∑h=1m(X¯[h]−μ^VSR)2)=∑h=1mE(X¯[h]2)−mE(μ^VSR2) =∑h=1m(σh2nh+μh2)−1m∑h=1mσh2nh−1mK∑h=1m(μh−μ)2−mμ2 =m−1m∑h=1mσh2nh+mK−1mK∑h=1m(μh−μ)2 and then E(Var^(μ^VSR))=Var(μ^VSR) Appendix B. Proof of Corollary 2 For the first part of Corollary 2 by putting n instead of nh in (1) and based on (6) and (7) it is easy to prove (2). For the second part, according to (3) we have (see page 264, David and Nagaraja 2003) μh−μ=ρxy(μ(h)−μ). (8) and then it is enough to replace (8) in (2). Appendix C. Proof of Theorem 3 For proving Theorem 3, with replacing (4) in (1) we have Var(μ^NVSR)=1m2((∑h=1mσh)2n•−∑h=1mσh2K)+σ2Km and then with replacing nh=n•/m in (1) for VSR with equal allocation we have Var(μ^VSR)−Var(μ^NVSR)=1mn•(∑h=1mσh2−(∑h=1mσh)2m) =1mn•(∑h=1m(σh−∑h=1mσhm)2)≥0 References Chen Z. , Bai Z. , Sinha B. ( 2004 ), Ranked Set Sampling: Theory and Applications. Lecture Notes in Statistics , New York : Springer . David H. A. , Nagaraja H. N. ( 2003 ), Order Statistic ( 3rd ed.), New York : Wiley . Gardiner P. ( 1999 ), Chamomile (Matricaria recutita, Anthemis nobilis) , Longwood Herbal Task Force Press , pp. 1 – 21 . Ghosh K. , Tiwari R. C. ( 2008 ), “ Estimating the Distribution Function Using L-Tuple Ranked Set Samples ,” Journal of Statistical Planning and Inference , 138 , 929 – 949 . Google Scholar CrossRef Search ADS Ghosh K. , Tiwari R. C. ( 2009 ), “ A Unified Approach to Variations of Ranked Set Sampling with Applications ,” Journal of Nonparametric Statistics , 21 , 471 – 485 . Google Scholar CrossRef Search ADS Kaur A. , Patil G. P. , Taillie C. A. ( 1997 ), “ Unequal Allocation Models for Ranked Set Sampling with Skew Distributions ,” Biometrics , 53 , 123–130 . Google Scholar CrossRef Search ADS Letchamo W. ( 1993 ), Nitrogen Application Affects Yield and Content of the Active Substances in Chamomile Genotypes , New Crops. New York : Wiley , pp. 636 – 639 MacEachern S. N. , Ozturk O. , Wolfe D. A. , Stark G. V. ( 2002 ), “ A New Ranked Set Sample Estimator of Variance ,” Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64 , 177 – 188 . Google Scholar CrossRef Search ADS McIntyre G. A. ( 1952 ), “ A Method of Unbiased Selective Sampling Using Ranked Sets ,” Australian Journal of Agricultural Research , 3 , 385 – 390 . Google Scholar CrossRef Search ADS Patil G. P. , Sinha A. K. , Taillie C. ( 1999 ), “ Ranked Set Sampling: A Bibliography ,” Environmental and Ecological Statistics , 6 , 91 – 98 . Google Scholar CrossRef Search ADS Ridout M. S. ( 2003 ), “ On Ranked Set Sampling for Multiple Characterestics ,” Environmental and Ecological Statistics , 10 , 225 – 262 . Google Scholar CrossRef Search ADS Sarndal C. E. , Swensson B. , Wretman J. ( 1992 ), Model Assisted Survey Sampling , New York : Springer . Samawi H. M. ( 1996 ), “ Stratified Ranked Set Sample ,” Pakistan Journal of Statistics , 12 , 9 – 16 . Wang Y.-G. , Chen Z. H. , Liu J. ( 2004 ), “ General Ranked Set Sampling with Cost Considerations ,” Biometrics , 60 , 556 – 561 . Google Scholar CrossRef Search ADS PubMed © The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Survey Statistics and Methodology Oxford University Press

An Unbalanced Ranked-Set Sampling Method to Get More Than One Sample from Each Set

Loading next page...
 
/lp/ou_press/an-unbalanced-ranked-set-sampling-method-to-get-more-than-one-sample-fMJrEVYtVR
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
2325-0984
eISSN
2325-0992
D.O.I.
10.1093/jssam/smx026
Publisher site
See Article on Publisher Site

Abstract

Abstract In ranked-set sampling, the restriction of selecting just one individual from each set may require too many sets. We propose a new version of ranked-set sampling that relaxes this restriction. Our new design uses stratified sampling in which ranked-set sampling is used to form the strata. Simulations, and a real case study on medicinal flowers, show that this design can be more precise and less costly than previous designs. 1. INTRODUCTION Ranked-set sampling (RSS) was first introduced by McIntyre (1952) and has been a widely used design in many applications. It is particularly appealing in agricultural and environmental sciences, where identifying sampling units in the field is straightforward, but exact measurement of the units is time consuming. The RSS technique takes several random samples of size m from the population. The sample units are ranked by some quick and easy measure. Then, one unit from each sample is chosen and precisely measured for the character of interest. Consider m samples of size m. The unit that has the lowest rank among the m units in the first sample is chosen, the unit with the second lowest rank is chosen from the second sample, and so on until the unit with the highest rank is chosen from the mth sample. This process is repeated n times, giving a final sample size, nm. This balanced RSS method can be generalized to unbalanced designs, where the number of sample units selected at each of the ranks is not constant. With highly skewed population distributions, more units from low (or high) ranks can be selected. Unbalanced designs are similar in concept to optimal allocation in stratified sampling where strata with larger variances take larger sample fractions. RSS is reported as being more efficient than simple random sampling (SRS) (Samawi 1996; Ridout 2003). See full reviews of RSS by Patil, Sinha, and Taillie (1999) and the related book of Chen, Bai, and Sinha (2004). In RSS designs, just one sample unit from each ranked-set is selected. Ranking can be based on some auxiliary, easy-to-measure variables, but the cost involved in sampling and ranking cannot be completely ignored. Wang, Chen, and Liu (2004) proposed a design to obtain more than one unit from each ranked-set and showed that the design can be more efficient than RSS. We call their design “L-Tuple RSS.” Ghosh and Tiwari (2008, 2009) considered L-Tuple RSS for estimating distribution functions and presented a general form of design that unifies different versions of RSS. For unbalanced RSS, McIntyre (1952) proposed that the sample size corresponding to each rank order should be proportional to its standard deviation; i.e., an unequal allocation, as realized in the Neyman approach. In this context, Kaur, Patil, and Taillie (1997) note that the variance of order statistics typically increases with the rank orders for positively skewed distributions on (0,+∞). This work is consistent with Neyman’s optimal allocation and proposes two models for unequal allocation for skew distributions. As in Wang’s approach, we introduce a design that relaxes the restriction to one sample unit from each set and the restriction with respect to the number of needed sets. Relaxing these two restrictions leads to fewer sets and a more flexible design. We introduce the new method, namely Virtual Stratified Sampling using RSS in section 2. In section 3, we define two criteria based on cost and precision to evaluate the design. Section 4 contains simulations and a real case study to compare the methods. Section 5 presents conclusions. 2. MORE THAN ONE SAMPLE UNIT FROM EACH SET 2.1 L-Tuple RSS (LTR) Wang et al. (2004) suggested forming Lm,t=(mt) ranked-sets of size m, and then selecting a sample of t units from each set, identified by mutually different ranks. For example, if t = 2, select the units with ranks one and two from the first set and units with ranks one and three from the second set. This process is repeated until each pair of ranks among Lm,2 possible pairs is selected. This design needs fewer ranked-set units than RSS, and Wang et al. (2004) showed that the design is more cost-efficient than RSS. However, the inflexibility of the design in attaining a desired sample size can be a weakness. Given the main parameters of the design (m and t) and the number of cycles (d), implementation requires that we obtain d×Lm,t sets with corresponding sample size given by d×t×Lm,t, which may impose large gaps between the possible sample sizes. For example, with m = 6 and t = 3, Lm,t=20, we can implement the design with 20, 40, 60, 80, etc., sets and sample sizes of 60, 120, 180, 240, etc., corresponding to d=1,2,3,4,…. 2.2 Virtual Stratified Sampling Using Ranked Set Sampling (VSR) We introduce a new strategy in which two, three, or all elements of a set can be selected in the final sample without any restriction to have a balanced form. Suppose that X is the variable of interest with E(X)=μ and V(X)=σ2<∞. The goal is to estimate μ and the variance of the estimator. Suppose further that Y is an auxiliary variable (used for ranking) with E(Y)=μy and V(Y)=σy2<∞. And suppose that Y has reasonable correlation with the main variable X. When it is possible to rank on X directly, we simply take Y = X in this formulation. For i=1,2,…,K, we select the ith set of m units from the population and assume that the corresponding (X, Y) values are independent and identically distributed from their bivariate distribution. We order the Y values for the m sampling units in the ith set as Y[1]i≤Y[2]i≤⋯≤Y[m]i and denote the corresponding X values ordered on Y as X[1]i,X[2]i,⋯,X[m]i. By conditioning on these X and Y values, we have a finite population with m strata, each of size K, as depicted in table 1. From the K units in stratum h of this stratified population, we draw a sample sh of size nh via simple random sampling without replacement (SRSWOR), where 1≤nh≤K and ∑h=1mnh=n•. We obtain the measurements of interest X[h]i for the units in sh. Table 1. Virtual Stratified Population Based on the Auxiliary Variable 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K Table 1. Virtual Stratified Population Based on the Auxiliary Variable 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K 1st stratum 2nd stratum ⋯ mth stratum 1st set X[1]1 X[2]1 ⋯ X[m]1 2nd set X[1]2 X[2]2 ⋯ X[m]2 ⋮ ⋮ ⋮ ⋱ ⋮ Kth set X[1]K X[2]K ⋯ X[m]K The proposed estimator is then μ^VSR=1m∑h=1mX¯[h] where X¯[h]=1nh∑i∈shX[h]i. The link of virtual stratified sampling (VSR) to finite population sampling makes analytic evaluation of VSR and its variants straightforward, as shown below. This analytic tractability is an advantage of VSR over rival designs. For a sample of size m, define μh=E(X| rank(Y)=h) and σh2=V(X| rank(Y)=h). The following theorem provides a simple estimate of variance of μ^VSR. Theorem 1 . In VSR, μ^VSR is an unbiased estimator for μ with Var(μ^VSR)=σ2Km+1m2∑h=1m1−nhKnhσh2 (1)and an unbiased estimator for Var(μ^VSR)is Var^(μ^VSR)=K−1m(mK−1)∑h=1m1nh(nh−1)∑iϵsh(X[h]i−X¯[h])2+1m(mK−1)∑h=1m(X¯[h]−μ^VSR)2 For the proof, see appendix A. If the distribution of the population is skewed, we can allocate nh to oversample strata with large variances and gain additional reduction in the estimator variance; see §2.3 below. Alternatively, we can sample with equal allocation across strata. Corollary 2 . Under VSR with equal allocation, i.e., nh=n;h=1,2,…,m, we have Var(μ^VSR)=1nm[σ2−(1−nK)m∑h=1m(μh−μ)2]. (2) If we assume X and Y are sampled from the linear regression model X=μ+ρxyσσy(Y−μy)+ɛ (3) where ρxy and ɛ are the Pearson correlation and a random variable independent from X respectively, then Var(μ^VSR)=1nm[σ2−ρxy2(1−nK)m∑h=1m(μ(h)−μ)2]. For the proof, see appendix B. It is possible to rewrite (2) as Var(μ^VSR)=σ2nm︸A−1nm2∑h=1m(μh−μ)2︸B+1Km2∑h=1m(μh−μ)2︸C and then we have A: the variance of the sample mean in SRS ( VarSRS) A-B: the variance of the sample mean in RSS ( VarRSS) A-B + C: the variance of the sample mean in VSR ( VarVSR) Now as A, B, and C are nonnegative, C≤B and also lim⁡K→∞C=0, it is easy to conclude the following: VarRSS≤VarSRS and VarVSR≤VarSRS. VarRSS≤VarVSR. But the simulations show with cost consideration, VSR can be more efficient than RSS. lim⁡K→∞VarVSR=VarRSS. It is notable that taking more than one sample from each set is useful when the cost of sampling or ranking is too high. 2.3 VSR with Neyman Allocation (NVSR) To enhance efficiency, we can use Neyman allocation, provided information about the variances of the strata is available. Different measurement costs in different strata can also modify the optimal allocation (Sarndal, Swensson, and Wretman 1992, p. 104). For example, it might be more costly to get precise measurements of the sampled elements in the stratum of the largest order statistic than in the stratum of the smallest order statistic. Here, we consider Neyman allocation when the cost is equal in all strata. NVSR is a special case of VSR with nh=n•σh∑h=1mσh. (4) We denote NVSR mean estimator with μ^NVSR. Theorem 3 shows what there is a gain by combining VSR with the Neyman allocation procedure. Theorem 3 . In VSR with equal allocation strategy, we have the following. Var(μ^VSR)Var(μ^NVSR)≥1. For the proof, see appendix C. Henceforth, VSR stands for VSR with equal allocation. 3. COMPARING VSR WITH LTR 3.1 Variance of the Designs Based on Wang et al. (2004), for the unbiased estimator of μ in LTR with d cycles, we have: Var(μ^LTR)=1dtLm,t[σ2−m−tm−11m∑h=1m(μh−μ)2] Then if we define the efficiency of VSR relative to LTR as efficiency(μ^VSR,μ^LTR)=Var(μ^LTR)Var(μ^VSR) we have efficiency(μ^VSR,μ^LTR)=1dtLm,t[σ2−m−tm−11m∑h=1m(μh−μ)2]1nm[σ2−(1−nK)1m∑h=1m(μh−μ)2] If we define nVSR and nLTR as total sample size to measure in VSR and LTR respectively and KVSR and KLTR as total sets used in VSR and LTR respectively, we have nVSR=nm,      nLTR=dtLm,t,      KVSR=K,      KLTR=dLm,t Now if we set nVSR=nLTR and KVSR=KLTR, it is easy to show that “ efficiency(μ^VSR,μ^LTR)≤1”. In other words, VSR cannot be more efficient than LTR with the same sample size and same number of sets. This happens because LTR separates the final sample units in the sets that have the least covariance between selected units. This is one of the advantages of LTR relative to VSR. But as a disadvantage, LTR is not flexible, meaning that depending on the parameters, either nLTR or KLTR is imposed in the design. On the other hand, VSR is very flexible in setting nVSR and KVSR. We believe VSR can be more efficient than LTR under two scenarios: Using its flexibility, we implement NVSR based on some auxiliary variables information about strata. Based on cost considerations, with nVSR=nLTR we set KVSR less than KLTR. Scenario one uses the stratified aspect of VSR. After the first-stage sample (of size mK), it is easy for a researcher to implement stratified sampling with Neyman allocation. A potential disadvantage of this scenario, if based on the variability of the auxiliary variable, is that the allocation cannot be completed, the final sample cannot be determined, and the final measurements cannot be obtained until all initial rankings and measurements are completed. This means retaining all of the sets until the final measurements are completed, which might be impractical in some applications (e.g., with live animals or time-sensitive materials). Scenario two is the most important advantage of VSR. This scenario helps us to use the advantage of RSS in efficiency without its strict restrictions in sampling. 3.2 Cost Considerations We define cost of the designs as CostLTR=dmLm,tCe+dmLm,tCy+dtLm,tCx for LTR CostVSR=mKCe+mKCy+mnCx for VSR where Ce: Cost for sampling one unit Cx, Cy: Cost for measuring the main variable/auxiliary variable of one unit Now we can define efficiency based on cost as CoEfficiency(μ^VSR,μ^LTR)=Var(μ^LTR)Var(μ^VSR)×CostLTRCostVSR (5) In (5), at the same time, we care about precision and cost. Note that if we rank visually, we should exchange the part of the cost corresponding to measurement of the auxiliary variable with the cost due to ranking. Since the expected number of pairwise comparisons needed for ranking m units is (m+2)(m−1)/4 (Wang et al. 2004), we propose the following two alternative costs: CostLTR*=dmLm,tCe+d(m+2)(m−1)4Lm,tCr+dtLm,tCx for LTR CostVSR*=mKCe+(m+2)(m−1)4KCr+mnCx for VSR where Cr is the cost for each pairwise comparison. 4. SIMULATION STUDIES In this section, we compare VSR and LTR using simulated populations with two scenarios and four distributions, and a real case study on medicinal flowers. In the simulations, at least one sample is selected from each stratum, regardless of its size or variance. All the simulations are done by “R 3.1.2” software. We used a Monte Carlo approach and simulated 20,000 samples for each design option. 4.1 Comparing VSR with LTR Using Some Simulations on Four Distributions We considered four distributions: normal, lognormal, chi-square and exponential. We performed the simulations with different parameters. For all of them, we set m = 5, KVSR=KLTR and nVSR=nLTR so that the costs of VSR and LTR were the same. For sorting observations in the sets to execute RSS and for variance information inside the strata needed in Neyman allocation, we used an auxiliary variable for each distribution, with approximately 0.70 correlation with the survey vaiable. Results are shown in figure 1. As we can see in the case of positively skewed distributions (like lognormal, chi-square and exponential), NVSR is better than LTR. As the distribution approaches a symmetric distribution or a distribution with light tails (as with chi-square and normal), the efficiency of NVSR reduces. Also, in almost all the cases with decreasing t, efficiency of NVSR reduces relative to LTR. Consideration of efficiency then suggests the use of NVSR rather than LTR for asymmetric distributions. Also, we note the increase in efficiency in the case of lognormal, resulting from a heavy right tail. Figure 1. View largeDownload slide Efficiency of Neyman for the 4 Distributions with Parameter v, that Is Changing in Horizontal Axis with Different t with m = 5. Figure 1. View largeDownload slide Efficiency of Neyman for the 4 Distributions with Parameter v, that Is Changing in Horizontal Axis with Different t with m = 5. For scenario two, for exp(2) and lognormal(0,2) we set nVSR=nLTR, m = 5, with t = 1 (figure 2) and t = 2 (figure 3). The costs were based on CostLTR and CostVSR. Given the three parameters LTR, t and m, the other parameters d and KLTR were imposed in LTR. However, VSR is flexible enough to allow selection of KVSR freely. Also, we let the parameter (Cy+Ce)/Cx vary from 0.1 to 20. In all the cases, we set Cx=5Cy. In figures 2 and 3, the left plots show VSR (scenario one), and the right plots show NVSR (scenarios one and two together). As we can see in the figures, by increasing the cost of sampling ( Cy+Ce) relative to the cost of measurement (Cx), CoEfficiency (5) increases. Also, smaller KVSR is more CoEfficient (i.e., Coefficiency ≥1) than larger KVSR. Coefficiency is larger for t = 1 relative to t = 2, because the cost of the former (which needs more sets) is more that the latter in LTR. Combination of the two scenarios (right plots) shows surprising CoEfficiency that rises as high as 10 for some cases for lognormal. Figure 2. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 1. Figure 2. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 1. Figure 3. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 2. Figure 3. View largeDownload slide CoEfficiency of NVSR and VSR for Exp(2) and Lognormal(0,2) with m = 5 and t = 2. 4.2 Comparing VSR with LTR Using a Real Case Study on Some Medicinal Flowers Matricaria chamomilla, commonly known as chamomile, is an annual plant of the Asteraceae family native to Iran and European countries. Chamomile is considered a very important commercial and medicinal plant. The plant flowers, usually dried, are used in herbal medicine for a sore stomach, irritable bowel syndrome, and as a gentle sleep aid. Chamomile taken as an herbal dried tea is also used as a mild laxative and has anti-inflammatory and bactericidal effects. The health benefits of chamomile are due to the flower’s essential oil content, which could be 0.24 to 2.0 percent of flower dry weight (Gardiner 1999). As a result, achieving the maximum amount of dried flower is critical to maximize essential oil yield. To reach this goal, plant breeders and agronomists need to evaluate and screen a large population of natural or agronomic variations (cultivars) to select the best ones and to improve the population means. Flower yield and essential oil are correlated with several traits, including number of flowers per plant, fresh flower weight and plant height (Letchamo 1993). These correlations could be exploited in a sampling procedure. 4.2.1 Plant material Seeds of a heterogeneous population of chamomile (Matricaria chamomila L.) collected from natural rangelands of Iran were planted in plastic pots with a top diameter of twenty centimeters and a depth of twenty-five centimeters filled with clay loam soil. Pots were arranged in a randomized complete block design with thirty pots in three replications under natural conditions in the field, located at Isfahan University of Technology, Isfahan, Iran. Five seeds were planted in each pot, and plants were irrigated every two to four days after flowering plant height, fresh and dry weight of flowers, peduncle length, number of flowers, head diameter, number of branches, and essential oil content (based on the percent of flower dry weight) were measured on each flower pot. In the next subsection, we use the information of this sample as a population in a simulation to evaluate the designs. 4.2.2 Design and results For the chamomilla data, we executed scenario one and two. We considered the population mean of “Essence” as the main parameter with μ=0.74. Skewness of the data was 2.21. Because we had no information about this quantity before sampling, and it is expensive to measure, we used three auxiliary variables, easy to measure and with reasonable Pearson correlation with the main variable, namely “Flower fresh weight” with correlation 0.33, “Stem height” with correlation 0.53, and “Number of Petals” with correlation 0.71. For Neyman allocation, we used the variance of the auxiliary variable. We set m=3,5,7,9. Results are in figures 4 and 5. Figure 4. View largeDownload slide Efficiency of NVSR on Real Data for Different m, t and Correlations. Figure 4. View largeDownload slide Efficiency of NVSR on Real Data for Different m, t and Correlations. Figure 5. View largeDownload slide CoEfficiency of VSR on Real Data for Different t and K with m = 5. Figure 5. View largeDownload slide CoEfficiency of VSR on Real Data for Different t and K with m = 5. In figure 4, we see the results of efficiency for Neyman allocation. In all the cases, we set KVSR=KLTR and nVSR=nLTR. In all the cases, NVSR is better than LTR. With increasing m, efficiency increases. In most of the cases, higher correlation is associated with higher efficiency, because it leads to an allocation closer to Neyman allocation. In figure 5, we see scenario two. Here again, the flexibility of VSR allows for less sets than LTR. We set nVSR=nLTR, m = 5 and t=1,2,3,4. The costs were based on costLTR and costVSR, and for the auxiliary variable “Number of Petals” was used. Also we let the relative cost Cy+CeCx vary from 0.1 to 5. In all the cases with increasing cost of sampling relative to measurement, Coefficiency increases. Coefficiency is larger for smaller KVSR. Figure 6 is presented to investigate the case of visual ranking based on costLTR* and costVSR*. For ranking, the auxiliary variable with 0.71 correlation was used, and we set nVSR=nLTR, Cr=Cx/20, t = 2 and m=3,4,5,6. Coefficiency is larger for smaller KVSR and with increasing cost of sampling and ranking relative to measurement, Coefficiency increases. Choosing smaller m reduces the efficiency of the stratification but reduces ranking costs, while larger m increases the efficiency of the stratification but with increased ranking costs. As we can see in figure 6, Coefficiency generally increases, showing that VSR is more successful than LTR in trading off between costs and efficiency. Figure 6. View largeDownload slide CoEfficiency of VSR on Real Data for Different m and K with t = 2 based on costLTR*and costVSR*. Figure 6. View largeDownload slide CoEfficiency of VSR on Real Data for Different m and K with t = 2 based on costLTR*and costVSR*. 5. CONCLUSION Our proposed new strategy is an easy and inexpensive version of RSS, allowing good control of the sample size within each stratum. The design is easy to implement and analytically tractable. Controlling the sample size enables us to use Neyman allocation leading to significant efficiency with a reasonable auxiliary variable. VSR just requires choosing a number of sets, K, larger than n, and the strategy is better than SRS. In brief, the design is efficient, easy to perform, easy to calculate, and allows good control of costs. Appendix A. Proof of Theorem 1 First we prove two important identities: μ=E(X1)=E(E(X1|rank(X1)))=1m∑h=1mE(X1|rank(X1)=h)=1m∑h=1mμh (6) σ2=Var(E(X1|rank(X1)))+E(Var(X1|rank(X1))) (7) =Var[∑h=1mμhI(rank(X1)=h)]+E[∑h=1mσh2I(rank(X1)=h)] =1m∑h=1m(μh−μ)2+1m∑h=1mσh2 where rank(X1) indicates rank of X1 in its selected set and I(rank(X1)=h) is an indicator function which takes 1, if rank(X1)=h. Now with μ^VSR=1m∑h=1mX¯[h]=1mK∑h=1m∑i∈shX[h]inh/K=1mK∑h=1m∑i=1KX[h]iIhinh/K where Ihi={1    if  i∈sh,0    otherwise, we have E(μ^VSR)=E[E(μ^VSR|XKm,YKm)]=E[1mK∑h=1m∑i=1KX[h]iE(Ihi|XKm,YKm)nh/K] =E(1mK∑h=1m∑i=1KX[h]inh/Knh/K)=1mK∑h=1m∑i=1KE(Xhi)=μ where XKm,YKm are for the entire finite population of size Km. For the variance we have Var(μ^VSR)=V[E(μ^VSR|XKm,YKm)]+E[V(μ^VSR|XKm,YKm)]. The first term V[E(μ^VSR|XKm,YKm)]=V[X¯Km]=σ2Km, The second term E[V(μ^VSR|XKm,YKm)]=E[1(mK)2∑h=1mK21−nhKnhShK2]=1m2∑h=1m1−nhKnhσh2 where ShK2=1K−1∑i=1K(X[h]i−X¯[h]K)2,   X¯[h]K=1K∑i=1KX[h]i and then Var(μ^VSR)=σ2Km+1m2∑h=1m1−nhKnhσh2 For proving the unbiasedness of the Var^(μ^VSR), we use some ideas of MacEachern, Ozturk, Wolfe, and Stark 2002). First note that in the design from (6) and (7) we have Var(μ^VSR)=1m2∑h=1mσh2nh+1m2K∑h=1m(μh−μ)2 and E(∑iϵsh(X[h]i−X¯[h])2)=∑i=1KE(X[h]i2Ihi)−nhE(X¯[h]2) =∑i=1KnhK(Var(X[h]i)+E2(X[h]i))−nh(Var(X¯[h])+E2(X¯[h])) =nh(σh2+μh2)−nh(1−nh/Knhσh2+1Kσh2+μh2)=(nh−1)σh2 and also E(∑h=1m(X¯[h]−μ^VSR)2)=∑h=1mE(X¯[h]2)−mE(μ^VSR2) =∑h=1m(σh2nh+μh2)−1m∑h=1mσh2nh−1mK∑h=1m(μh−μ)2−mμ2 =m−1m∑h=1mσh2nh+mK−1mK∑h=1m(μh−μ)2 and then E(Var^(μ^VSR))=Var(μ^VSR) Appendix B. Proof of Corollary 2 For the first part of Corollary 2 by putting n instead of nh in (1) and based on (6) and (7) it is easy to prove (2). For the second part, according to (3) we have (see page 264, David and Nagaraja 2003) μh−μ=ρxy(μ(h)−μ). (8) and then it is enough to replace (8) in (2). Appendix C. Proof of Theorem 3 For proving Theorem 3, with replacing (4) in (1) we have Var(μ^NVSR)=1m2((∑h=1mσh)2n•−∑h=1mσh2K)+σ2Km and then with replacing nh=n•/m in (1) for VSR with equal allocation we have Var(μ^VSR)−Var(μ^NVSR)=1mn•(∑h=1mσh2−(∑h=1mσh)2m) =1mn•(∑h=1m(σh−∑h=1mσhm)2)≥0 References Chen Z. , Bai Z. , Sinha B. ( 2004 ), Ranked Set Sampling: Theory and Applications. Lecture Notes in Statistics , New York : Springer . David H. A. , Nagaraja H. N. ( 2003 ), Order Statistic ( 3rd ed.), New York : Wiley . Gardiner P. ( 1999 ), Chamomile (Matricaria recutita, Anthemis nobilis) , Longwood Herbal Task Force Press , pp. 1 – 21 . Ghosh K. , Tiwari R. C. ( 2008 ), “ Estimating the Distribution Function Using L-Tuple Ranked Set Samples ,” Journal of Statistical Planning and Inference , 138 , 929 – 949 . Google Scholar CrossRef Search ADS Ghosh K. , Tiwari R. C. ( 2009 ), “ A Unified Approach to Variations of Ranked Set Sampling with Applications ,” Journal of Nonparametric Statistics , 21 , 471 – 485 . Google Scholar CrossRef Search ADS Kaur A. , Patil G. P. , Taillie C. A. ( 1997 ), “ Unequal Allocation Models for Ranked Set Sampling with Skew Distributions ,” Biometrics , 53 , 123–130 . Google Scholar CrossRef Search ADS Letchamo W. ( 1993 ), Nitrogen Application Affects Yield and Content of the Active Substances in Chamomile Genotypes , New Crops. New York : Wiley , pp. 636 – 639 MacEachern S. N. , Ozturk O. , Wolfe D. A. , Stark G. V. ( 2002 ), “ A New Ranked Set Sample Estimator of Variance ,” Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64 , 177 – 188 . Google Scholar CrossRef Search ADS McIntyre G. A. ( 1952 ), “ A Method of Unbiased Selective Sampling Using Ranked Sets ,” Australian Journal of Agricultural Research , 3 , 385 – 390 . Google Scholar CrossRef Search ADS Patil G. P. , Sinha A. K. , Taillie C. ( 1999 ), “ Ranked Set Sampling: A Bibliography ,” Environmental and Ecological Statistics , 6 , 91 – 98 . Google Scholar CrossRef Search ADS Ridout M. S. ( 2003 ), “ On Ranked Set Sampling for Multiple Characterestics ,” Environmental and Ecological Statistics , 10 , 225 – 262 . Google Scholar CrossRef Search ADS Sarndal C. E. , Swensson B. , Wretman J. ( 1992 ), Model Assisted Survey Sampling , New York : Springer . Samawi H. M. ( 1996 ), “ Stratified Ranked Set Sample ,” Pakistan Journal of Statistics , 12 , 9 – 16 . Wang Y.-G. , Chen Z. H. , Liu J. ( 2004 ), “ General Ranked Set Sampling with Cost Considerations ,” Biometrics , 60 , 556 – 561 . Google Scholar CrossRef Search ADS PubMed © The Author 2017. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Journal of Survey Statistics and MethodologyOxford University Press

Published: Sep 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off