简体   繁体   English

不同层次的随机抽样

[英]Random sampling with different strata

I have a dataset from which I want to select a random sample of rows, but following some pre-defined rules. 我有一个数据集,我想从中选择随机的行样本,但要遵循一些预定义的规则。 This may be a very basic question but I am very new to this and still trying to grasp the basic concepts. 这可能是一个非常基本的问题,但是我对此并不陌生,仍在尝试掌握基本概念。 My dataset includes some 330 rows of data (I have included a simplified version here) with several columns. 我的数据集包含约330行数据(这里包括简化版本)和几列。 I want to sample 50 rows out of the 330 (I kept these numbers in the mock dataset for simplicity as this is part of the problem I am having) with the option to add the predefined rules to the sampling process. 我想对330行中的50行进行采样(为简化起见,我将这些数字保留在模拟数据集中,因为这是我所遇到的问题的一部分),并可以选择向采样过程中添加预定义的规则。 Here is a mock version of the data: 这是数据的模拟版本:

bank<-data.frame(matrix(0,nrow=330,ncol=5))
colnames(bank)<-c("id","var1","var2","year","lo")
bank$id<-c(1:330)
bank$var1<-sample(letters[1:5],330,replace=T)
bank$var2<-sample(c("s","r"),330,replace=T)
bank$var3<-sample(2010:2018,330,replace=T)
bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)

The code I used to try to sample the correct number of rows is 我用来尝试正确行数的代码是

library(splitstackshape)
x<-splitstackshape::stratified(indt=bank,group=c("var1","var2","year","lo"),0.151)

However this is not selecting 50 rows. 但是,这不是选择50行。 I had initially tried to define size=50 but I got the following error: 我最初尝试定义size = 50,但出现以下错误:

Groups b s 2012 lo4,... [there is a very long list here],...contain fewer rows than requested. Returning all rows.

Then I tried to define size as a percent: 0.151 (15.1%?) which should be right 50 out of 330 but that samples 5 rows (I tried 0.5 and samples 44 rows and if I try 0.500000001 it samples 287 rows???). 然后,我尝试将大小定义为百分比:0.151(15.1%?),它应该是330中的50,但是采样了5行(我尝试了0.5行,并采样了44行,如果我尝试了0.500000001,则采样了287行???) 。

What am I missing? 我想念什么? For the moment I am stuck here. 目前,我被困在这里。

Once I manage to sample the correct number of rows (50) I would like to define some rules, like: only upto 50% of the sample can have 2018 (bank$year) AND only up to half of the bank$year==2018 rows can have bank$var2=="r". 一旦我设法对正确的行数(50)进行采样,我想定义一些规则,例如:只有最多50%的样本可以有2018(bank $ year),最多只有一半bank_year == 2018行可以具有bank $ var2 ==“ r”。 Obviously I don't expect someone to do this for me, but could you please provide some advice on 显然,我不希望有人为我这样做,但是请您提供一些建议

1- Why am I getting the wrong number of rows (probably just syntax?) 2- what package I should look into if splitstackshape::stratified() is not the best or a good choice to achieve this? 1-为什么我得到的行数不正确(可能只是语法?)2-如果splitstackshape :: stratified()不是实现这一目标的最佳选择,我应该考虑哪个包?

Many thanks! 非常感谢! M 中号

I think the issues comes from the fact that your dataset (as you've shared here) is fairly small, you have a large number of strata (5 letters X 2 s or r X 9 years X 6 lo categories), and it's just not possible to take samples of the desired size from within each stratum. 我认为问题来自以下事实:您的数据集(如您在此处共享的)很小,您拥有大量的阶层(5个字母X 2 s或r X 9年X 6个lo类别),而这仅仅是无法从每个阶层中获取所需大小的样本。 When I bump the sample size up to 33,000 and take a sample of 15.1%, I get a sample of size 4,994. 当我将样本量增加到33,000并采取15.1%的样本时,我得到的样本量为4,994。 Putting size = 50 is requesting a sample of size 50 from each stratum, which is not remotely possible with the data you've shared. 放置大小= 50是从每个层中请求大小为50的样本,这对于您共享的数据来说是不可能的。

> bank<-data.frame(matrix(0,nrow=33000,ncol=5))
> colnames(bank)<-c("id","var1","var2","year","lo")
> bank$id<-c(1:33000)
> bank$var1<-sample(letters[1:5],33000,replace=T)
> bank$var2<-sample(c("s","r"),33000,replace=T)
> bank$var3<-sample(2010:2018,33000,replace=T)
> bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
> 
> k <- stratified(bank, group = c('var1', 'var2', 'var3', 'lo'), size = .151)
> dim(k)
[1] 4994    6

Another process, by selecting the n = sample desired for each group, provided by Jenny Bryan here ; 另一个过程,通过选择每组所需的n =个样本,由Jenny Bryan 在此处提供 sampling from groups where you specify n based on the specific sample size per group, samp is the randomized sample per n group; 从每组中基于特定样本量指定n的组中抽样,samp是每n组中的随机抽样; so n will need to be adjusted according to the proportionate amount per group: 因此需要根据每组的比例调整n:

bank %>% 
  group_by(var1) %>% 
  nest() %>% 
  mutate(n = c(7,0,9,1,13),
         samp = map2(data, n, sample_n)) %>% 
  select(var1, samp) %>% 
  unnest()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM