按组划分数据组

Question

I have the following dataset: 我有以下数据集：

df<- as.data.frame(c(rep("a", times = 9), rep("b", times = 18), rep("c", times = 27)))
colnames(df)<-"Location"
Year<-c(rep(1:3,times = 3), rep(1:6, times = 3), rep(1:9, times = 3))
df$Year<-Year

df<- df %>%
      mutate(Predictor = seq_along(Location)) %>%
      ungroup(df)

print(df)

Location Year Predictor
        a    1         1
        a    2         2
        a    3         3
        a    1         4
        a    2         5
        a    3         6
        a    1         7
        a    2         8
        a    3         9
        b    1        10
        b    2        11
        b    3        12
        b    4        13
        b    5        14
... 40 more rows

I want to split the above dataframe into training and test sets. 我想将上述数据框分为训练集和测试集。 For the test set, I want to randomly sample a third of the number of years in each Location, while keeping the years together. 对于测试集，我想在每个位置中随机抽取三分之一的年份，同时将这些年份保持在一起。 So if year "1" is selected for location "a", I want all three "1's" in the test set and so on. 因此，如果将位置“ a”选择为年份“ 1”，则我希望测试集中的所有三个“ 1”都以此类推。 My test set should look something like this: 我的测试集应如下所示：

 Location Year Predictor
        a    1         1
        a    1         4
        a    1         7
        b    3        12
        b    3        18
        b    3        24
        b    5        14
        b    5        20
        b    5        26
        c    3        30
        c    3        39
        c    3        48
        c    6        33
        c    6        42
        c    6        51
        c    7        34
        c    7        43
        c    7        52

I found a similar question here , but this procedure would sample the same year and the same number of years from every location (and YEAR is numeric, not a factor). 我在这里找到了类似的问题，但是此过程将从每个位置采样相同的年份和相同的年数（而YEAR是数字，而不是一个因子）。 I want a different random sample of years from each location and a proportional number of samples. 我希望从每个位置获取不同的年份随机抽样，并按比例分配样本数量。

Would like to do this in dplyr if possible 如果可能，希望在dplyr中执行此操作

Answer 1

You can first create a distinct set of year/location combinations, then sample some of them for each location and use that in a semi_join on the original data. 您可以先创建一组独特的年份/位置组合，然后为每个位置采样一些，然后在原始数据的semi_join使用它们。 This could be done as: 可以这样做：

df %>% 
  distinct(Location, Year) %>% 
  group_by(Location) %>% 
  sample_frac(.3) %>% 
  semi_join(df, .)

#    Location Year Predictor
# 1         a    3         3
# 2         a    3         6
# 3         a    3         9
# 4         b    4        13
# 5         b    4        19
# 6         b    4        25
# 7         b    5        14
# 8         b    5        20
# 9         b    5        26
# 10        c    8        35
# 11        c    8        44
# 12        c    8        53
# 13        c    1        28
# 14        c    1        37
# 15        c    1        46
# 16        c    2        29
# 17        c    2        38
# 18        c    2        47

按组划分数据组

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-03-30 15:14:45

按组划分数据组

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-03-30 15:14:45

解决方案1
2 已采纳 2017-03-30 15:14:45