R- 每组采样随机行，直到达到最大行数

Question

I have a data set from which I want to take a random sample by group up to 30 rows.我有一个数据集，我想从中抽取最多 30 行的随机样本。 However, I also want to make sure that at least 1 row for another grouping is included.但是，我还想确保包含至少 1 行用于另一个分组。 Additionally, some groups have less than 30 rows, in which case all of the rows for that group should be included.此外，某些组的行数少于 30，在这种情况下，应包括该组的所有行。 I can't include the exact data set I'm working with because it's proprietary;我不能包含我正在使用的确切数据集，因为它是专有的； however, an example for a data frame df would be:但是，数据框 df 的示例是：

ID|Age|State|Gender|Salary ID|年龄|国家|性别|工资

1 25 CO M 50000 1 25 米 50000
2 34 CO M 72000 2 34 COM 72000
3 28 CO M 52000 3 28 COM 52000
4 25 CO F 44000 4 25 CO F 44000
5 25 CA F 55000 5 25 CA F 55000
6 34 CA F 100000 6 34 CA F 100000
7 39 CA M 88000 7 39 CA M 88000
8 34 CA M 59000 8 34 CA M 59000
... up to 15000 rows ... 最多 15000 行

So, I want a random sample of the data set so that no more than 30 rows are given from each state.所以，我想要一个数据集的随机样本，以便每个州不超过 30 行。 Then, for each state, I want at least 1 row for each age and gender that exists in the data set.然后，对于每个州，我希望数据集中存在的每个年龄和性别至少有 1 行。 If there are less than 30 age/gender combinations for a given state, but there are more than 30 rows for that state, then the sample should include multiple rows for a given age/gender so that 30 rows are given for that state.如果给定州的年龄/性别组合少于 30 个，但该州有 30 多行，则样本应包含给定年龄/性别的多行，以便为该州提供 30 行。 If there are less than 30 rows for that state, then I want all the rows in the data set for that state.如果该州的行数少于 30，那么我想要该州数据集中的所有行。 If there are more than 30 age/gender combinations for a given state, then the sample should have 1 of each up to 30.如果给定州的年龄/性别组合超过 30 个，则样本应各有 1 个，最多 30 个。

Is there a way for me to do this in R?我有没有办法在 R 中做到这一点？

Answer 1

Here is some code that takes you half the way.这里有一些代码可以让你完成一半。 First I simulated data, that resembles yours.首先我模拟了数据，这类似于你的数据。

df <-
  data.frame(
    ID = 1:1500,
    Age = sample(18:99, 1500, replace = TRUE),
    State = sample(state.abb, 1500, replace = TRUE),
    Gender = sample(c("M", "F"), 1500, replace = TRUE),
    Salary = sample(44:100 * 1000, 1500, replace = TRUE)
  )

Then with group_by() you can create the state grouping, determine the rows per state with mutate() and n() .然后使用group_by()可以创建状态分组，使用mutate()和n()确定每个状态的行。 That information can then be used to draw samples with sample_n() , that adjust to the group size.然后可以使用该信息通过sample_n()绘制样本，以适应组大小。

library(dplyr)
df %>% 
  group_by(State) %>% 
  mutate(n_state = n()) %>% 
  sample_n(ifelse(n >= 30, 30, n))

This could be extended to calculate further group sizes you mention to use that information to ensure you hit the quotas you are looking for.这可以扩展到计算您提到的更多组大小，以使用该信息确保您达到所需的配额。 Unfortunately I do no fully understand what your quotas are from your question.不幸的是，我不完全了解您的问题中的配额是多少。

R- 每组采样随机行，直到达到最大行数

问题描述

1 个解决方案

解决方案1
0 2020-11-03 16:42:56

R- 每组采样随机行，直到达到最大行数

问题描述

1 个解决方案

解决方案1 0 2020-11-03 16:42:56

解决方案1
0 2020-11-03 16:42:56