简体   繁体   English

R:使用dplyr基于列值的子集data.frame

[英]R: subset data.frame based on column value using dplyr

library(dplyr)
mydat1 <- data.frame(ID = c(1, 1, 2, 2),
                    Gender = c("Male", "Female", "Male", "Male"),
                    Score = c(30, 40, 20, 60))
mydat1 %>%
  group_by(ID, Gender) %>%
  slice(which.min(Score))

# A tibble: 3 x 3
# Groups:   ID, Gender [3]
     ID Gender Score
  <dbl> <fctr> <dbl>
1     1 Female    40
2     1   Male    30
3     2   Male    20

I'm trying to group the rows by ID and Gender . 我正在尝试按IDGender对行进行分组。 And then I want to only keep the row with the lowest Score . 然后,我只想保留Score最低的行。 The above code works perfectly because when ID == 2 , I only kept the entry with the lower score. 上面的代码非常有效,因为当ID == 2 ,我只保留得分较低的条目。

mydat2 <- data.frame(ID = c(1, 1, 2, 2),
                    Gender = c("Male", "Female", "Male", "Male"),
                    Score = c(NA, NA, 20, 60))

mydat2 %>%
  group_by(ID, Gender) %>%
  slice(which.min(Score))

# A tibble: 1 x 3
# Groups:   ID, Gender [1]
     ID Gender Score
  <dbl> <fctr> <dbl>
1     2   Male    20

However, when I have NAs, which.min doesn't work like I want it to because it'll not return a valid index. 但是,当我有NA时, which.min不会像我想要的那样工作,因为它不会返回有效的索引。 Instead, all of my ID == 1 entries are erased. 而是删除了我所有的ID == 1条目。 My desired output in this scenario is: 在这种情况下,我期望的输出是:

# A tibble: 1 x 3
# Groups:   ID, Gender [1]
     ID Gender Score
  <dbl> <fctr> <dbl>
1     1 Female    NA
2     1   Male    NA
1     2   Male    20

How can I modify my code to account for this? 如何修改我的代码以解决此问题?

Edit: 编辑:

df2 <- structure(list(pubmed_id = c(23091106L, 23091106L), Gender = structure(c(4L, 
                                                                                4L), .Label = c("", "Both", "female", "Female", "Male"), class = "factor"), 
                      Total_Carrier = c(NA, 1107)), class = c("grouped_df", "tbl_df", 
                                                              "tbl", "data.frame"), row.names = c(NA, -2L), vars = "pubmed_id", drop = TRUE, indices = list(
                                                                0:1), group_sizes = 2L, biggest_group_size = 2L, labels = structure(list(
                                                                  pubmed_id = 23091106L), class = "data.frame", row.names = c(NA, 
                                                                                                                              -1L), vars = "pubmed_id", drop = TRUE, .Names = "pubmed_id"), .Names = c("pubmed_id", 
                                                                                                                                                                                                       "Gender", "Total_Carrier"))

> df2
# A tibble: 2 x 3
# Groups:   pubmed_id [1]
  pubmed_id Gender Total_Carrier
      <int> <fctr>         <dbl>
1  23091106 Female            NA
2  23091106 Female          1107

In this example, I would want the desired output to only contain row 2 (ie the row with carrier sample size of 1107). 在此示例中,我希望所需的输出仅包含第2行(即,载波样本大小为1107的行)。 However, I get the following result: 但是,我得到以下结果:

> df2 %>%
   group_by(pubmed_id, Gender) %>%
   slice(which.min(Total_Carrier) || 1)

# A tibble: 1 x 3
# Groups:   pubmed_id, Gender [1]
  pubmed_id Gender Total_Carrier
      <int> <fctr>         <dbl>
1  23091106 Female            NA

which.min ignores the missing values, and returns integer(0) when the input vector contains solely NA s. 当输入向量仅包含NA时, which.min忽略缺失值,并返回integer(0) You can add a condition check in the slice , ie when all Scores are NA s in a group, pick the first row: 您可以在slice添加条件检查,即,当所有分数均在一个组中均为NA ,选择第一行:

mydat2 %>%
     group_by(ID, Gender) %>%
     slice({idx <- which.min(Score); if(length(idx) > 0) idx else 1})

# A tibble: 3 x 3
# Groups:   ID, Gender [3]
#     ID Gender Score
#  <dbl> <fctr> <dbl>
#1     1 Female    NA
#2     1   Male    NA
#3     2   Male    20

You could also use arrange to sort your scores within your groups, and then slice to select the first row of each group. 您还可以使用“ arrange对组中的分数进行排序,然后进行slice以选择每个组的第一行。 That way, if there are only NAs in the group, you would still select the first row: 这样,如果组中仅NA,则仍将选择第一行:

mydat2 %>%
group_by(ID, Gender) %>%
arrange(ID,Gender,Score) %>%
slice(1)
     ID Gender Score
  <dbl> <fctr> <dbl>
1     1 Female    NA
2     1   Male    NA
3     2   Male    20

Here is another option with which and pmin 这是另一种选择与whichpmin

mydat2 %>%
   group_by(ID, Gender) %>% 
   slice(pmin(1, which(Score == min(Score, na.rm = TRUE))[1], na.rm = TRUE))
# A tibble: 3 x 3
# Groups:   ID, Gender [3]
#      ID Gender Score
#   <dbl> <fctr> <dbl>
#1     1 Female    NA
#2     1   Male    NA
#3     2   Male    20

A solution using data.table 使用data.table的解决方案

library(data.table)
setDT(mydat2)
mydat2[, .(Score = sort(Score)[1]), by = .(ID, Gender)]
#    ID Gender Score
# 1:  1   Male    NA
# 2:  1 Female    NA
# 3:  2   Male    20

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM