不同层次的随机抽样

Question

I have a dataset from which I want to select a random sample of rows, but following some pre-defined rules. 我有一个数据集，我想从中选择随机的行样本，但要遵循一些预定义的规则。 This may be a very basic question but I am very new to this and still trying to grasp the basic concepts. 这可能是一个非常基本的问题，但是我对此并不陌生，仍在尝试掌握基本概念。 My dataset includes some 330 rows of data (I have included a simplified version here) with several columns. 我的数据集包含约330行数据（这里包括简化版本）和几列。 I want to sample 50 rows out of the 330 (I kept these numbers in the mock dataset for simplicity as this is part of the problem I am having) with the option to add the predefined rules to the sampling process. 我想对330行中的50行进行采样（为简化起见，我将这些数字保留在模拟数据集中，因为这是我所遇到的问题的一部分），并可以选择向采样过程中添加预定义的规则。 Here is a mock version of the data: 这是数据的模拟版本：

bank<-data.frame(matrix(0,nrow=330,ncol=5))
colnames(bank)<-c("id","var1","var2","year","lo")
bank$id<-c(1:330)
bank$var1<-sample(letters[1:5],330,replace=T)
bank$var2<-sample(c("s","r"),330,replace=T)
bank$var3<-sample(2010:2018,330,replace=T)
bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)

The code I used to try to sample the correct number of rows is 我用来尝试正确行数的代码是

library(splitstackshape)
x<-splitstackshape::stratified(indt=bank,group=c("var1","var2","year","lo"),0.151)

However this is not selecting 50 rows. 但是，这不是选择50行。 I had initially tried to define size=50 but I got the following error: 我最初尝试定义size = 50，但出现以下错误：

Groups b s 2012 lo4,... [there is a very long list here],...contain fewer rows than requested. Returning all rows.

Then I tried to define size as a percent: 0.151 (15.1%?) which should be right 50 out of 330 but that samples 5 rows (I tried 0.5 and samples 44 rows and if I try 0.500000001 it samples 287 rows???). 然后，我尝试将大小定义为百分比：0.151（15.1％？），它应该是330中的50，但是采样了5行（我尝试了0.5行，并采样了44行，如果我尝试了0.500000001，则采样了287行???）。

What am I missing? 我想念什么？ For the moment I am stuck here. 目前，我被困在这里。

Once I manage to sample the correct number of rows (50) I would like to define some rules, like: only upto 50% of the sample can have 2018 (bank$year) AND only up to half of the bank$year==2018 rows can have bank$var2=="r". 一旦我设法对正确的行数（50）进行采样，我想定义一些规则，例如：只有最多50％的样本可以有2018（bank $ year），最多只有一半bank_year == 2018行可以具有bank $ var2 ==“ r”。 Obviously I don't expect someone to do this for me, but could you please provide some advice on 显然，我不希望有人为我这样做，但是请您提供一些建议

1- Why am I getting the wrong number of rows (probably just syntax?) 2- what package I should look into if splitstackshape::stratified() is not the best or a good choice to achieve this? 1-为什么我得到的行数不正确（可能只是语法？）2-如果splitstackshape :: stratified（）不是实现这一目标的最佳选择，我应该考虑哪个包？

Many thanks! 非常感谢！ M 中号

Answer 1

I think the issues comes from the fact that your dataset (as you've shared here) is fairly small, you have a large number of strata (5 letters X 2 s or r X 9 years X 6 lo categories), and it's just not possible to take samples of the desired size from within each stratum. 我认为问题来自以下事实：您的数据集（如您在此处共享的）很小，您拥有大量的阶层（5个字母X 2 s或r X 9年X 6个lo类别），而这仅仅是无法从每个阶层中获取所需大小的样本。 When I bump the sample size up to 33,000 and take a sample of 15.1%, I get a sample of size 4,994. 当我将样本量增加到33,000并采取15.1％的样本时，我得到的样本量为4,994。 Putting size = 50 is requesting a sample of size 50 from each stratum, which is not remotely possible with the data you've shared. 放置大小= 50是从每个层中请求大小为50的样本，这对于您共享的数据来说是不可能的。

> bank<-data.frame(matrix(0,nrow=33000,ncol=5))
> colnames(bank)<-c("id","var1","var2","year","lo")
> bank$id<-c(1:33000)
> bank$var1<-sample(letters[1:5],33000,replace=T)
> bank$var2<-sample(c("s","r"),33000,replace=T)
> bank$var3<-sample(2010:2018,33000,replace=T)
> bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
> 
> k <- stratified(bank, group = c('var1', 'var2', 'var3', 'lo'), size = .151)
> dim(k)
[1] 4994    6

Answer 2

Another process, by selecting the n = sample desired for each group, provided by Jenny Bryan here ; 另一个过程，通过选择每组所需的n =个样本，由Jenny Bryan 在此处提供； sampling from groups where you specify n based on the specific sample size per group, samp is the randomized sample per n group; 从每组中基于特定样本量指定n的组中抽样，samp是每n组中的随机抽样； so n will need to be adjusted according to the proportionate amount per group: 因此需要根据每组的比例调整n：

bank %>% 
  group_by(var1) %>% 
  nest() %>% 
  mutate(n = c(7,0,9,1,13),
         samp = map2(data, n, sample_n)) %>% 
  select(var1, samp) %>% 
  unnest()

不同层次的随机抽样

问题描述

2 个解决方案

解决方案1
1 2019-01-14 14:27:00

解决方案2
0 2019-01-14 15:28:28

不同层次的随机抽样

问题描述

2 个解决方案

解决方案1 1 2019-01-14 14:27:00

解决方案2 0 2019-01-14 15:28:28

解决方案1
1 2019-01-14 14:27:00

解决方案2
0 2019-01-14 15:28:28