简体   繁体   English

R- 每组采样随机行,直到达到最大行数

[英]R- Sample random row per group until reaching max number of rows

I have a data set from which I want to take a random sample by group up to 30 rows.我有一个数据集,我想从中抽取最多 30 行的随机样本。 However, I also want to make sure that at least 1 row for another grouping is included.但是,我还想确保包含至少 1 行用于另一个分组。 Additionally, some groups have less than 30 rows, in which case all of the rows for that group should be included.此外,某些组的行数少于 30,在这种情况下,应包括该组的所有行。 I can't include the exact data set I'm working with because it's proprietary;我不能包含我正在使用的确切数据集,因为它是专有的; however, an example for a data frame df would be:但是,数据框 df 的示例是:

ID|Age|State|Gender|Salary ID|年龄|国家|性别|工资

1 25 CO M 50000 1 25 米 50000
2 34 CO M 72000 2 34 COM 72000
3 28 CO M 52000 3 28 COM 52000
4 25 CO F 44000 4 25 CO F 44000
5 25 CA F 55000 5 25 CA F 55000
6 34 CA F 100000 6 34 CA F 100000
7 39 CA M 88000 7 39 CA M 88000
8 34 CA M 59000 8 34 CA M 59000
... up to 15000 rows ... 最多 15000 行

So, I want a random sample of the data set so that no more than 30 rows are given from each state.所以,我想要一个数据集的随机样本,以便每个州不超过 30 行。 Then, for each state, I want at least 1 row for each age and gender that exists in the data set.然后,对于每个州,我希望数据集中存在的每个年龄和性别至少有 1 行。 If there are less than 30 age/gender combinations for a given state, but there are more than 30 rows for that state, then the sample should include multiple rows for a given age/gender so that 30 rows are given for that state.如果给定州的年龄/性别组合少于 30 个,但该州有 30 多行,则样本应包含给定年龄/性别的多行,以便为该州提供 30 行。 If there are less than 30 rows for that state, then I want all the rows in the data set for that state.如果该州的行数少于 30,那么我想要该州数据集中的所有行。 If there are more than 30 age/gender combinations for a given state, then the sample should have 1 of each up to 30.如果给定州的年龄/性别组合超过 30 个,则样本应各有 1 个,最多 30 个。

Is there a way for me to do this in R?我有没有办法在 R 中做到这一点?

Here is some code that takes you half the way.这里有一些代码可以让你完成一半。 First I simulated data, that resembles yours.首先我模拟了数据,这类似于你的数据。

df <-
  data.frame(
    ID = 1:1500,
    Age = sample(18:99, 1500, replace = TRUE),
    State = sample(state.abb, 1500, replace = TRUE),
    Gender = sample(c("M", "F"), 1500, replace = TRUE),
    Salary = sample(44:100 * 1000, 1500, replace = TRUE)
  )

Then with group_by() you can create the state grouping, determine the rows per state with mutate() and n() .然后使用group_by()可以创建状态分组,使用mutate()n()确定每个状态的行。 That information can then be used to draw samples with sample_n() , that adjust to the group size.然后可以使用该信息通过sample_n()绘制样本,以适应组大小。

library(dplyr)
df %>% 
  group_by(State) %>% 
  mutate(n_state = n()) %>% 
  sample_n(ifelse(n >= 30, 30, n))

This could be extended to calculate further group sizes you mention to use that information to ensure you hit the quotas you are looking for.这可以扩展到计算您提到的更多组大小,以使用该信息确保您达到所需的配额。 Unfortunately I do no fully understand what your quotas are from your question.不幸的是,我不完全了解您的问题中的配额是多少。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM