简体   繁体   English

按最小单元格大小的 R 样本

[英]R Sample By Minimum Cell Size

set.seed(1)
data=data.frame(SCHOOL = rep(1:10, each = 1000), GRADE = sample(7:12, r = T, size = 10000),SCORE = sample(1:100, r = T, size = 10000))

I have 'data' that contains information about student test score.我有包含有关学生考试成绩信息的“数据”。 I wish to: count how many GRADE for each SCHOOL, and then take the smallest value of GRADE for all SCHOOLS.我希望:计算每个SCHOOL有多少个GRADE,然后取所有SCHOOLS的GRADE最小值。 Like this:像这样:

For each SCHOOL count the number of rows for a specific GRADE.对于每个 SCHOOL,计算特定 GRADE 的行数。 Then for each GRADE find the smallest values across all SCHOOLs.然后为每个 GRADE 找出所有 SCHOOL 的最小值。 Finally I wish to take a random sample based on the smallest value found in step 2.最后,我希望根据第 2 步中找到的最小值随机抽样。

So basically in this basic example with two SCHOOLs and GRADE 7 and GRADE 8:所以基本上在这个有两个 SCHOOL 和 GRADE 7 和 GRADE 8 的基本示例中: 在此处输入图片说明

SCHOOL 1 has 2 SCOREs for GRADE 7 and SCHOOL 1 has 3 SCOREs for GRADE 8. SCHOOL 1 的 7 年级有 2 个分数,SCHOOL 1 的 8 年级有 3 个分数。

SCHOOL 2 has 1 SCOREs for GRADE 7 and SCHOOL 2 has 4 SCOREs for GRADE 8. SCHOOL 2 的 7 年级有 1 分,SCHOOL 2 的 8 年级有 4 分。

So the new data contains one SCORE for GRADE 7 from SCHOOL 1 and SCHOOL 2, and three SCORE for GRADE 8 from SCHOOL 1 and SCHOOL 2 and these SCORE that are picked are RANDOMLY SAMPLED.因此,新数据包含来自 SCHOOL 1 和 SCHOOL 2 的一个 GRADE 7 SCORE,以及来自 SCHOOL 1 和 SCHOOL 2 的三个 GRADE 8 SCORE,并且这些 SCORE 是随机抽样的。

like this:像这样:

在此处输入图片说明

My attempt: data[, .SD[sample(x = .N, size = min(sum(GRADE), .N))], by = .(SCHOOL,GRADE]我的尝试: data[, .SD[sample(x = .N, size = min(sum(GRADE), .N))], by = .(SCHOOL,GRADE]

This follows your description of how to do it step-by-step.这遵循您对如何逐步执行此操作的描述。

library(data.table)
setDT(data)
data[, N := .N, .(SCHOOL, GRADE)]
data[, N := min(N), GRADE]
data[, .(SCORE = sample(SCORE, N)), .(SCHOOL, GRADE, N)][, -'N']

If you have multiple SCORE -like columns and you want keep the same rows from each then you can use .SD like in your attempt:如果您有多个类似SCORE的列并且您希望每个列都保留相同的行,那么您可以在尝试中使用.SD

data[, .SD[sample(.N, N)], .(SCHOOL, GRADE, N)][, -'N']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM