简体   繁体   English

使用R或SAS进行群集采样

[英]Cluster sampling with R or SAS

I do have 155k points distributed in 2k groups. 我确实在2k组中分配了155k点。 There are 3 kind of points (A+B+C=#points) 有3种点(A + B + C =#分)

Frequency distribution: 频率分布:

  Gr #clients    #A    #B    #C
-------------------------------
  01      100    80    10    10
  02       10     0     3     7

2000      400   300    80    20
--------------------------------
TOTAL: 155000 93000 46500 15500

I want to select random groups of points to a total of 6,000 points, such as the proportions of each type of point in the sample is the same than in the population. 我想选择随机点组到总共6,000个点,例如样本中每种类型点的比例与总体中的比例相同。

Is there a method for this in R or SAS? 在R或SAS中有这种方法吗? or should I perform a simple random survey and then design some algorithm of group substitution till I get the balanced sample?, 或者我应该进行简单的随机调查,然后设计一些群体替代算法,直到我得到平衡样本?

EXAMPLE 1: THIS IS HOW I WOULD DO IT IN SAS. 例1:这就是我如何在SAS中做到的。 If code makes you nervous, use the simpler method in EXAMPLE 2, below. 如果代码让您感到紧张,请使用下面示例2中的更简单的方法。

Note: What you're describing sounds like a proportional sample, not a cluster sample, so that's what I've shown here. 注意:您所描述的内容听起来像是比例样本,而不是群集样本,所以这就是我在这里展示的内容。 Hope that meets your needs. 希望能满足您的需求。

      /******** sort by strata *****/
      proc sort data=MED_pts_155k ; by GRoup A_B_C clients ; run ;

      /******** create sample design ***/
      proc surveyselect noprint  
      data= MED_pts_155k   
      method=srs  
      seed = 7  
      n = 6000  
      out = sample_design ;  
      strata GRoup A_B_C  /  
        alloc=prop NOSAMPLE
        allocmin = 2  ; /*** min of 2 per stratum.  ****/  
     run ;

    /******** pull sample **********/
    proc surveyselect noprint
      data= MED_pts_155k
      method=sys
      seed = &seed 
      n = sample_design
      out = MY_SAMPLE ;
     strata GRoup A_B_C  ; 
    run ;

The "alloc = prop" option gives you proportional (ie 'even') sampling. “alloc = prop”选项为您提供比例(即“偶数”)采样。 The "nosample" option in SAS allows you to generate a separate file outlining the sample design. SAS中的“nosample”选项允许您生成概述样本设计的单独文件。 You then use the design in a second stage where you actually pull the sample. 然后,您可以在第二阶段使用该设计,您可以实际拉出样品。 If this is too much bother you can leave off the "nosample" option, and go straight to pulling your sample as we as we did in the simpler example below. 如果这太麻烦你可以省去“nosample”选项,直接拉动你的样本就像我们在下面的简单示例中所做的那样。

Note that in the second step above we've chosen to switch to 'method = SYS', instead of simple random sample (SRS). 请注意,在上面的第二步中,我们选择切换到'method = SYS',而不是简单的随机样本(SRS)。 SRS would work too, but since you may have different types of responses by client, you might want to sample in a representative way across the range of clients. SRS也可以工作,但由于客户可能有不同类型的响应,因此您可能希望以一种代表性的方式在客户端范围内进行采样。 To do that you sort within each stratum by client and intentionally sample in even increments across the range of clients; 要做到这一点,您可以在客户的每个阶层内进行排序,并在客户范围内有意地采样甚至增量; this is a called a "systematic" sample (SYS). 这是一个叫做“系统”的样本(SYS)。

EXAMPLE 2: SIMPLER 例2:SIMPLER

You could also do it all in one simple step if you want less code, and don't need to see the sample design broken down in a separate file. 如果您需要更少的代码,并且不需要在单独的文件中查看样本设计,您也可以在一个简单的步骤中完成所有操作。

/******** sort by strata *****/
proc sort data=MED_pts_155k ; by GRoup A_B_C ; run ;

/******** pull sample **********/
proc surveyselect noprint
  data= MED_pts_155k
  method= SRS
  seed = 7 
  n = 6000
  out = MY_SAMPLE ;
 strata GRoup A_B_C  / 
    alloc=prop 
    allocmin = 2  ; 
run ;

In both examples we're assuming you have two stratification variables: 'GRoup' and a second variable 'A_B_C' which contains values of a, b. 在这两个例子中,我们假设你有两个分层变量:'GRoup'和第二个变量'A_B_C',它包含a,b的值。 or c. 或c。 Hope that helps. 希望有所帮助。 Cluster sampling is possible in SAS as well, but as noted above, I've illustrated a proportional sample here since that seems to be what you need. SAS也可以进行集群抽样,但如上所述,我在这里说明了一个比例样本,因为这似乎是你需要的。 Cluster sampling would take a little more space to describe. 集群抽样需要更多的空间来描述。

i don't understand your fake data so i'll make my own. 我不明白你的假数据,所以我会自己做。

i'm assuming you construct your own unique groups. 我假设你构建自己独特的群体。 i've just used the numbers 1:2000 to do it, but you can run this code on any group type.. 我刚刚使用数字1:2000来做,但你可以在任何组类型上运行此代码..

# let's make some fake data with 155k points distributed in 2k groups
x <- 
    data.frame(
        groupname = sample( x = 1:2000 , size = 155000 , replace = TRUE ) ,
        anothercol = 1 ,
        andanothercol = "hi"
    )

# look at your data frame `x`
head( x )
# so long as you've constructed a `groupname` variable in your data, it's easy

# calculate the proportion of each group in the total
groupwise.prob <- table( x$groupname ) / nrow( x )
# store that into a probability vector

# convert this to a data frame
prob.frame <- data.frame( groupwise.prob )

head( prob.frame )

# rename the `Var1` column to match your group name variable on `x`
names( prob.frame )[ 1 ] <- 'groupname'

# rename the `Freq` column to say what it is on `x`
names( prob.frame )[ 2 ] <- 'prob'

# merge these individual probabilities back onto your data frame
x <- merge( x , prob.frame , all.x = TRUE )

# now just use the sample function's prob= parameter off of that
# and scale down the size to what you want
recs.to.samp <-
    sample( 
        1:nrow( x ) , 
        size = 6000 , 
        replace = FALSE , 
        prob = x$prob 
    )

# and now here's your new sample, with proportions in tact
y <- x[ recs.to.samp , ]

head( y )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM