根据特定列值的概率，将10,000行的数据集采样为100个唯一的数据集

Question

I have a dataset of 20 rows with 4 columns A,B,C,D. 我有20行的数据集，其中4列是A，B，C，D。 [simplified data set] [简化数据集]

Original data set: 原始数据集：

>data
ID Name Age Type
1  ABC  23   A
2  CDE  34   A
3  ABCE  23   C
4  CDEYU  34   B 
5  ABCW  23   A
6  CDEDR  34   B 
7  ASER  23   A
8  CDEAW  34   B 
9  ABCHKJ  23   A
10  CDEFDE  34   C 
11  ABCDDD  23   A
12  CDEDDD  34   A
13  ABCEDDD  23   C
14  CDEYUDDD  34   B 
15  ABCWDDD  23   A
16  CDEDRDDD  34   B 
17  ASERDDD  23   A
18  CDEAWDDD  34   B 
19  ABCHKJDDD  23   A    
20  CDEFDEDDD  34   C

Here the "Type" column is distributed in such a way that probabilities of A,B,C is (0.5, 0.3, 0.2) respectively. 此处，“类型”（Type）列的分布方式是A，B，C的概率分别为（0.5、0.3、0.2）。

Now, I want to cut two unique sets of 10 each, so that each set will have 10 rows with the same probability distribution. 现在，我想剪切两个独特的集合，每个集合10个，这样每个集合将具有10个具有相同概率分布的行。

Can I use the sample function to achieve this purpose? 我可以使用示例功能来实现此目的吗？

Something like this: 像这样：

sample(data, 10, replace=F, prob((data$Type="A")=0.5,(data$Type="B")=0.3,(data$Type="C")=0.2))

Also, how do I write a loop to get this continuously for a big set of 100 rows? 另外，如何编写一个循环以连续获取一大组100行呢？ I mean 10 sets from a dataset of 100 rows. 我的意思是从100行的数据集中得到10套。

Expected Output: 预期产量：

Dataset 1: 数据集1：

ID Name Age Type
1  ABC  23   A
2  CDE  34   A
3  ABCE  23   C
4  CDEYU  34   B 
5  ABCW  23   A
6  CDEDR  34   B 
7  ASER  23   A
8  CDEAW  34   B 
9  ABCHKJ  23   A
10  CDEFDE  34   C

Dataset 2: 数据集2：

ID Name Age Type
1  ABCDDD  23   A
2  CDEDDD  34   A
3  ABCEDDD  23   C
4  CDEYUDDD  34   B 
5  ABCWDDD  23   A
6  CDEDRDDD  34   B 
7  ASERDDD  23   A
8  CDEAWDDD  34   B 
9  ABCHKJDDD  23   A
10  CDEFDEDDD  34   C

Any help in this regard would be greatly appreciated. 在这方面的任何帮助将不胜感激。

Answer 1

Here is one way to achieve what I believe you intend to do: 这是实现我认为您打算做的事情的一种方法：

d <- data.frame(id=1:100,
                type=sample(unlist(mapply(rep, c('A', 'B', 'C'), 
                                          c(50, 30, 20), USE.NAMES=F))),
                group=NA)

d <- within(d, {
  group[which(type=='A')] <- sample(gl(10, 5))
  group[which(type=='B')] <- sample(gl(10, 3))
  group[which(type=='C')] <- sample(gl(10, 2))
})


foo <- split(d[, 1:2], d$group) 
# above, adjust 1:2 to reflect which columns you want 
#  to include in the split data.frames.

foo[1:2] # First 2 (of 10) elements

$`1`
     id type
20   20    A
31   31    C
34   34    C
37   37    A
42   42    A
52   52    B
60   60    A
74   74    B
77   77    A
100 100    B

$`2`
   id type
1   1    C
17 17    C
27 27    A
46 46    B
57 57    B
58 58    A
62 62    B
71 71    A
72 72    A
89 89    A

Each element of list foo has 5 x A , 3 x B , and 2 x C . 列表foo每个元素具有5 x A ，3 x B和2 x C This is achieved by identifying the indices corresponding to each type in turn (using which ), then assigning permuted group numbers 1 through 10 (with the number of repetitions corresponding to your desired distribution). 这是通过识别对应于每个索引实现type反过来（使用which ），则1至10分配经置换的组号码（与重复的对应于您的期望的分布数）。 Finally, split is used to split the data.frame to a list of data.frames. 最后， split用于将data.frame拆分为data.frames列表。

To generalise this solution to a dataset with 10,000 rows, with 100 rows in each subset, simply adjust the arguments to gl , eg group[which(type=='A')] <- sample(gl(100, 50)) (if there are 5000 A in the large dataset). 要将这种解决方案推广到具有10,000行，每个子集中100行的数据集，只需将参数调整为gl ，例如group[which(type=='A')] <- sample(gl(100, 50)) （如果大型数据集中有5000 A ）。

根据特定列值的概率，将10,000行的数据集采样为100个唯一的数据集

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-02-24 10:25:58

根据特定列值的概率，将10,000行的数据集采样为100个唯一的数据集

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-02-24 10:25:58

解决方案1
0 已采纳 2014-02-24 10:25:58