简体   繁体   English

根据特定列值的概率,将10,000行的数据集采样为100个唯一的数据集

[英]Sample a data set of 10,000 rows into unique sets of 100 based on probability of a particular column value

I have a dataset of 20 rows with 4 columns A,B,C,D. 我有20行的数据集,其中4列是A,B,C,D。 [simplified data set] [简化数据集]

Original data set: 原始数据集:

>data
ID Name Age Type
1  ABC  23   A
2  CDE  34   A
3  ABCE  23   C
4  CDEYU  34   B 
5  ABCW  23   A
6  CDEDR  34   B 
7  ASER  23   A
8  CDEAW  34   B 
9  ABCHKJ  23   A
10  CDEFDE  34   C 
11  ABCDDD  23   A
12  CDEDDD  34   A
13  ABCEDDD  23   C
14  CDEYUDDD  34   B 
15  ABCWDDD  23   A
16  CDEDRDDD  34   B 
17  ASERDDD  23   A
18  CDEAWDDD  34   B 
19  ABCHKJDDD  23   A    
20  CDEFDEDDD  34   C 

Here the "Type" column is distributed in such a way that probabilities of A,B,C is (0.5, 0.3, 0.2) respectively. 此处,“类型”(Type)列的分布方式是A,B,C的概率分别为(0.5、0.3、0.2)。

Now, I want to cut two unique sets of 10 each, so that each set will have 10 rows with the same probability distribution. 现在,我想剪切两个独特的集合,每个集合10个,这样每个集合将具有10个具有相同概率分布的行。

Can I use the sample function to achieve this purpose? 我可以使用示例功能来实现此目的吗?

Something like this: 像这样:

sample(data, 10, replace=F, prob((data$Type="A")=0.5,(data$Type="B")=0.3,(data$Type="C")=0.2))

Also, how do I write a loop to get this continuously for a big set of 100 rows? 另外,如何编写一个循环以连续获取一大组100行呢? I mean 10 sets from a dataset of 100 rows. 我的意思是从100行的数据集中得到10套。

Expected Output: 预期产量:

Dataset 1: 数据集1:

ID Name Age Type
1  ABC  23   A
2  CDE  34   A
3  ABCE  23   C
4  CDEYU  34   B 
5  ABCW  23   A
6  CDEDR  34   B 
7  ASER  23   A
8  CDEAW  34   B 
9  ABCHKJ  23   A
10  CDEFDE  34   C 

Dataset 2: 数据集2:

ID Name Age Type
1  ABCDDD  23   A
2  CDEDDD  34   A
3  ABCEDDD  23   C
4  CDEYUDDD  34   B 
5  ABCWDDD  23   A
6  CDEDRDDD  34   B 
7  ASERDDD  23   A
8  CDEAWDDD  34   B 
9  ABCHKJDDD  23   A
10  CDEFDEDDD  34   C 

Any help in this regard would be greatly appreciated. 在这方面的任何帮助将不胜感激。

Here is one way to achieve what I believe you intend to do: 这是实现我认为您打算做的事情的一种方法:

d <- data.frame(id=1:100,
                type=sample(unlist(mapply(rep, c('A', 'B', 'C'), 
                                          c(50, 30, 20), USE.NAMES=F))),
                group=NA)

d <- within(d, {
  group[which(type=='A')] <- sample(gl(10, 5))
  group[which(type=='B')] <- sample(gl(10, 3))
  group[which(type=='C')] <- sample(gl(10, 2))
})


foo <- split(d[, 1:2], d$group) 
# above, adjust 1:2 to reflect which columns you want 
#  to include in the split data.frames.

foo[1:2] # First 2 (of 10) elements

$`1`
     id type
20   20    A
31   31    C
34   34    C
37   37    A
42   42    A
52   52    B
60   60    A
74   74    B
77   77    A
100 100    B

$`2`
   id type
1   1    C
17 17    C
27 27    A
46 46    B
57 57    B
58 58    A
62 62    B
71 71    A
72 72    A
89 89    A

Each element of list foo has 5 x A , 3 x B , and 2 x C . 列表foo每个元素具有5 x A ,3 x B和2 x C This is achieved by identifying the indices corresponding to each type in turn (using which ), then assigning permuted group numbers 1 through 10 (with the number of repetitions corresponding to your desired distribution). 这是通过识别对应于每个索引实现type反过来(使用which ),则1至10分配经置换的组号码(与重复的对应于您的期望的分布数)。 Finally, split is used to split the data.frame to a list of data.frames. 最后, split用于将data.frame拆分为data.frames列表。

To generalise this solution to a dataset with 10,000 rows, with 100 rows in each subset, simply adjust the arguments to gl , eg group[which(type=='A')] <- sample(gl(100, 50)) (if there are 5000 A in the large dataset). 要将这种解决方案推广到具有10,000行,每个子集中100行的数据集,只需将参数调整为gl ,例如group[which(type=='A')] <- sample(gl(100, 50)) (如果大型数据集中有5000 A )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM