简体   繁体   English

dplyr 唯一行 sample_n

[英]dplyr unique rows sample_n

I am trying to get random samples based on group for a relatively large data frame.我正在尝试根据组为相对较大的数据框获取随机样本。 I need to get unique results for each group member only - I can't have results duplicate for an individual member or overall.我只需要为每个组成员获得唯一的结果 - 我不能为单个成员或整体获得重复的结果。

I have used this code successfully for small samples:我已经成功地将此代码用于小样本:

    processors2 <- processors%>%filter(str_detect(Person.Who.Changed.Object, "A0")) %>% 
      group_by(User)%>% sample_n(., 2)

However, if I use the below similar code, I get multiple duplicates both within groups and overall (ie member 1 and member 3 get the same row of data, and member 1 gets 2 of a different row all to itself).但是,如果我使用下面类似的代码,我会在组内和整体上得到多个重复项(即成员 1 和成员 3 获得同一行数据,成员 1 获得 2 个完全不同的行)。

claimallocator2 <- claimallocator%>%
  group_by(User)%>% sample_n(80, weight = Claim.Amt)

Additionally, it makes no difference if I add replace = FALSE.此外,如果我添加 replace = FALSE 也没有区别。 I am still getting duplicates.我仍然得到重复。

The expected output (obviously on a drastically smaller scale):预期输出(显然规模小得多):

User    Warranty.Claim  Claim.amt
User 1  1   500
User 1  2   1000
User 1  3   1500
User 1  4   2000
User 1  5   2500
User 2  6   3000
User 2  7   3500
User 2  8   4000
User 2  9   4500
User 2  10  5000
User 2  11  5500
User 2  12  6000
User 3  13  6500
User 3  14  7000
User 3  15  7500
User 3  16  8000
User 3  17  8500
User 3  18  9000
User 3  19  9500
User 3  20  10000
User 3  21  10500
User 3  22  11000

What I am actually getting:我实际得到的是:

    User    Warranty.Claim  Claim.amt
    User 1  1   500
    User 1  1   500
    User 1  3   1500
    User 1  4   2000
    User 1  5   2500
    User 2  6   3000
    User 2  7   3500
    User 2  8   4000
    User 2  9   4500
    User 2  10  5000
    User 2  11  5500
    User 2  12  6000
    User 3  13  6500
    User 3  14  7000
    User 3  15  7500
    User 3  16  8000
    User 3  17  8500
    User 3  18  9000
    User 3  19  9500
    User 3  8   4000
    User 3  21  10500
    User 3  22  11000

Try this approach: first remove the duplicated rows, then group by user and sample the desired number of cases.尝试这种方法:首先删除重复的行,然后按用户分组并采样所需数量的案例。

# create toy data
df <- data.frame(user=sample(1:10,1000,T),
                 warranty=sample(1:10,1000,T),
                 claim=sample(1:10,1000,T))

# count number of duplicate user-warranty-claim trios
df %>% count(user,warranty,claim) %>% arrange(desc(n))

# remove duplicates, sample 2 cases per user
df %>% group_by(user,warranty,claim) %>% slice(1) %>% 
  ungroup() %>% group_by(user) %>% sample_n(2)

您可以检查sample_n()函数中的replace选项

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM