[英]Sample a data set of 10,000 rows into unique sets of 100 based on probability of a particular column value
I have a dataset of 20 rows with 4 columns A,B,C,D. 我有20行的数据集,其中4列是A,B,C,D。 [simplified data set]
[简化数据集]
Original data set: 原始数据集:
>data
ID Name Age Type
1 ABC 23 A
2 CDE 34 A
3 ABCE 23 C
4 CDEYU 34 B
5 ABCW 23 A
6 CDEDR 34 B
7 ASER 23 A
8 CDEAW 34 B
9 ABCHKJ 23 A
10 CDEFDE 34 C
11 ABCDDD 23 A
12 CDEDDD 34 A
13 ABCEDDD 23 C
14 CDEYUDDD 34 B
15 ABCWDDD 23 A
16 CDEDRDDD 34 B
17 ASERDDD 23 A
18 CDEAWDDD 34 B
19 ABCHKJDDD 23 A
20 CDEFDEDDD 34 C
Here the "Type" column is distributed in such a way that probabilities of A,B,C is (0.5, 0.3, 0.2) respectively. 此处,“类型”(Type)列的分布方式是A,B,C的概率分别为(0.5、0.3、0.2)。
Now, I want to cut two unique sets of 10 each, so that each set will have 10 rows with the same probability distribution. 现在,我想剪切两个独特的集合,每个集合10个,这样每个集合将具有10个具有相同概率分布的行。
Can I use the sample function to achieve this purpose? 我可以使用示例功能来实现此目的吗?
Something like this: 像这样:
sample(data, 10, replace=F, prob((data$Type="A")=0.5,(data$Type="B")=0.3,(data$Type="C")=0.2))
Also, how do I write a loop to get this continuously for a big set of 100 rows? 另外,如何编写一个循环以连续获取一大组100行呢? I mean 10 sets from a dataset of 100 rows.
我的意思是从100行的数据集中得到10套。
Expected Output: 预期产量:
Dataset 1: 数据集1:
ID Name Age Type
1 ABC 23 A
2 CDE 34 A
3 ABCE 23 C
4 CDEYU 34 B
5 ABCW 23 A
6 CDEDR 34 B
7 ASER 23 A
8 CDEAW 34 B
9 ABCHKJ 23 A
10 CDEFDE 34 C
Dataset 2: 数据集2:
ID Name Age Type
1 ABCDDD 23 A
2 CDEDDD 34 A
3 ABCEDDD 23 C
4 CDEYUDDD 34 B
5 ABCWDDD 23 A
6 CDEDRDDD 34 B
7 ASERDDD 23 A
8 CDEAWDDD 34 B
9 ABCHKJDDD 23 A
10 CDEFDEDDD 34 C
Any help in this regard would be greatly appreciated. 在这方面的任何帮助将不胜感激。
Here is one way to achieve what I believe you intend to do: 这是实现我认为您打算做的事情的一种方法:
d <- data.frame(id=1:100,
type=sample(unlist(mapply(rep, c('A', 'B', 'C'),
c(50, 30, 20), USE.NAMES=F))),
group=NA)
d <- within(d, {
group[which(type=='A')] <- sample(gl(10, 5))
group[which(type=='B')] <- sample(gl(10, 3))
group[which(type=='C')] <- sample(gl(10, 2))
})
foo <- split(d[, 1:2], d$group)
# above, adjust 1:2 to reflect which columns you want
# to include in the split data.frames.
foo[1:2] # First 2 (of 10) elements
$`1`
id type
20 20 A
31 31 C
34 34 C
37 37 A
42 42 A
52 52 B
60 60 A
74 74 B
77 77 A
100 100 B
$`2`
id type
1 1 C
17 17 C
27 27 A
46 46 B
57 57 B
58 58 A
62 62 B
71 71 A
72 72 A
89 89 A
Each element of list foo
has 5 x A
, 3 x B
, and 2 x C
. 列表
foo
每个元素具有5 x A
,3 x B
和2 x C
This is achieved by identifying the indices corresponding to each type
in turn (using which
), then assigning permuted group numbers 1 through 10 (with the number of repetitions corresponding to your desired distribution). 这是通过识别对应于每个索引实现
type
反过来(使用which
),则1至10分配经置换的组号码(与重复的对应于您的期望的分布数)。 Finally, split
is used to split the data.frame to a list of data.frames. 最后,
split
用于将data.frame拆分为data.frames列表。
To generalise this solution to a dataset with 10,000 rows, with 100 rows in each subset, simply adjust the arguments to gl
, eg group[which(type=='A')] <- sample(gl(100, 50))
(if there are 5000 A
in the large dataset). 要将这种解决方案推广到具有10,000行,每个子集中100行的数据集,只需将参数调整为
gl
,例如group[which(type=='A')] <- sample(gl(100, 50))
(如果大型数据集中有5000 A
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.