简体   繁体   中英

Error when specifiying correct model with svydesgin, R survey package

I am sampling from a dataset I created myself. It is a two stage cluster sample. However, I do not seem to specify my design without error (the way I would want to).

I have created a database based on information I have from census EA data from Zanzibar.

The data contains 2 districts. District 1 has 32 subunits (called Shehias) and District 2 has 29. In turn each of the 61 shehias has between 2 and 19 Enumerations Areas (EAs). EAs themselves contain between 51 and 129 households.

The data selection process is the following: All (2) districts and all (61) shehias are included. In each shehia, 2 EAs are selected at random. In each selected EA 22/26 households (depending on the district) are selected. All household members should be selected.

Hence this is a two stage clustering process. The Primary Sampling Unit (PSU) is the EA, the SSU are the households. Both selections are at random.

These are the first six rows of the selected data called strategy_2:

    District_C Shehia_Code        EA_Code           HH_Number District_Numb District_Shehias Shehia_EAs HH_in_EA Prev_U3R3
    1          2        2_11 510201107001_1 510201107001_1_1165             1               29         19      115         0
    2          2        2_11 510201107001_1 510201107001_1_1165             1               29         19      115         0
    3          2        2_11 510201107001_1 510201107001_1_1165             1               29         19      115         0
    4          2        2_11 510201107001_1 510201107001_1_1165             1               29         19      115         0
    5          2        2_11 510201107001_1 510201107001_1_1165             1               29         19      115         0
    6          2        2_11 510201107001_1 510201107001_1_1173             1               29         19      115         1

If I spell out the whole process (including things as clusters that actually are not), then my design ought to be:

    strategy_2_Design <- svydesign(id =  ~ District_C    + Shehia_Code      + EA_Code    + HH_Number,
                                    fpc = ~ District_Numb + District_Shehias + Shehia_EAs + HH_in_EA,
                                    data = strategy_2)

Here I define the district and the number of districts in the survey as well as the same for Shehias. In both cases sample pop = population pop so the weight contribution is 1 at each stage. The third and fourth element are the actual sampling units.

This design will give me a correct estimate (weights are correct) but the model only has one degree of freedom (2 districts – 1). Hence when I try to calculate values for subunits of Shehias through svyby it can calculate means but if I use svyciprop as FUN the confidence interval is NA because the degrees of freedom of the subset are 0.

Trying to reduce the model down to the two stages I truly am using does not work. Namely

    strategy_2_Alt_1 <- svydesign(id =  ~ EA_Code     + HH_Number,
                   fpc = ~ Shehia_EAs  + HH_in_EA,
                   data = strategy_2)

yields:

    record 1 stage 1 : popsize= 19  sampsize= 122
    Error in as.fpc(fpc, strata, ids, pps = pps) : 
    FPC implies >100% sampling in some strata

Note that 19 is the number of subunits (EAs) in that (first) PSU, 122 is the number of EAs all the sample (2 for each of the 61 Shehias, thus 122).

One way around could be to claim that EAs were stratified by Shehia. This would be:

    strategy_2_Alt_2 <- svydesign(id =  ~ EA_Code     + HH_Number,
                    fpc = ~ Shehia_EAs  + HH_in_EA,
                    strata = ~ Shehias_Cat + NULL,
                    data = strategy_2)

Shehias_Cat simply contains the name of the Shehia each EA is in. This give a stratified 2 level cluster sampling design with (122, 2916) clusters. The weights here are the same as in the first design (strategy_2_Design):

    > identical(weights(strategy_2_Design),weights(strategy_2_Alt_2))
    [1] TRUE

Hence if I calculate the mean using the weights by hand I get the same result. However, if I try to use svymean to do this calculation, I get an error:

    > svymean(~Prev_U3R3, strategy_2_Alt_2)
    Error in v.sub[[i]] : subscript out of bounds
    In addition: Warning message:
    In by.default(1:n, list(as.numeric(clusters[, 1])), function(index) { :
    NAs introduced by coercion

So my questions are 1) where do these errors come from and 2) how do I define my model correctly? I have been trying to think about this many a way but do not seem to get it right.

The data and my code are to get to this issue are available under https://www.dropbox.com/sh/u1ajzxaxgue57r8/AAAkCfPC2YrwhEq6gbLsQmGQa?dl=0 .

I think you want

strategy_2_SHORT_Design <- svydesign(id =     ~ factor(EA_Code)    + HH_Number,
                                  fpc =    ~ Shehia_EAs + HH_in_EA,
                                  strata = ~ Shehias_Cat,
                                  data = strategy_2)

The design has households sampled within EA, within strata defined by shehias, and the population size in EAs is given by Shehia_EAs and then the size in households is given by HH_in_EA . In your data, EA_Code was a character variable, but it has to be numeric or factor.

The documentation for svydesign should make this clear, but doesn't, presumably because of the default conversion of strings to factors back in primitive times when the function was written.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM