[英]R: subset data.frame based on column value using dplyr
library(dplyr)
mydat1 <- data.frame(ID = c(1, 1, 2, 2),
Gender = c("Male", "Female", "Male", "Male"),
Score = c(30, 40, 20, 60))
mydat1 %>%
group_by(ID, Gender) %>%
slice(which.min(Score))
# A tibble: 3 x 3
# Groups: ID, Gender [3]
ID Gender Score
<dbl> <fctr> <dbl>
1 1 Female 40
2 1 Male 30
3 2 Male 20
I'm trying to group the rows by ID
and Gender
. 我正在尝试按
ID
和Gender
对行进行分组。 And then I want to only keep the row with the lowest Score
. 然后,我只想保留
Score
最低的行。 The above code works perfectly because when ID == 2
, I only kept the entry with the lower score. 上面的代码非常有效,因为当
ID == 2
,我只保留得分较低的条目。
mydat2 <- data.frame(ID = c(1, 1, 2, 2),
Gender = c("Male", "Female", "Male", "Male"),
Score = c(NA, NA, 20, 60))
mydat2 %>%
group_by(ID, Gender) %>%
slice(which.min(Score))
# A tibble: 1 x 3
# Groups: ID, Gender [1]
ID Gender Score
<dbl> <fctr> <dbl>
1 2 Male 20
However, when I have NAs, which.min
doesn't work like I want it to because it'll not return a valid index. 但是,当我有NA时,
which.min
不会像我想要的那样工作,因为它不会返回有效的索引。 Instead, all of my ID == 1
entries are erased. 而是删除了我所有的
ID == 1
条目。 My desired output in this scenario is: 在这种情况下,我期望的输出是:
# A tibble: 1 x 3
# Groups: ID, Gender [1]
ID Gender Score
<dbl> <fctr> <dbl>
1 1 Female NA
2 1 Male NA
1 2 Male 20
How can I modify my code to account for this? 如何修改我的代码以解决此问题?
Edit: 编辑:
df2 <- structure(list(pubmed_id = c(23091106L, 23091106L), Gender = structure(c(4L,
4L), .Label = c("", "Both", "female", "Female", "Male"), class = "factor"),
Total_Carrier = c(NA, 1107)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -2L), vars = "pubmed_id", drop = TRUE, indices = list(
0:1), group_sizes = 2L, biggest_group_size = 2L, labels = structure(list(
pubmed_id = 23091106L), class = "data.frame", row.names = c(NA,
-1L), vars = "pubmed_id", drop = TRUE, .Names = "pubmed_id"), .Names = c("pubmed_id",
"Gender", "Total_Carrier"))
> df2
# A tibble: 2 x 3
# Groups: pubmed_id [1]
pubmed_id Gender Total_Carrier
<int> <fctr> <dbl>
1 23091106 Female NA
2 23091106 Female 1107
In this example, I would want the desired output to only contain row 2 (ie the row with carrier sample size of 1107). 在此示例中,我希望所需的输出仅包含第2行(即,载波样本大小为1107的行)。 However, I get the following result:
但是,我得到以下结果:
> df2 %>%
group_by(pubmed_id, Gender) %>%
slice(which.min(Total_Carrier) || 1)
# A tibble: 1 x 3
# Groups: pubmed_id, Gender [1]
pubmed_id Gender Total_Carrier
<int> <fctr> <dbl>
1 23091106 Female NA
which.min
ignores the missing values, and returns integer(0)
when the input vector contains solely NA
s. 当输入向量仅包含
NA
时, which.min
忽略缺失值,并返回integer(0)
。 You can add a condition check in the slice
, ie when all Scores are NA
s in a group, pick the first row: 您可以在
slice
添加条件检查,即,当所有分数均在一个组中均为NA
,选择第一行:
mydat2 %>%
group_by(ID, Gender) %>%
slice({idx <- which.min(Score); if(length(idx) > 0) idx else 1})
# A tibble: 3 x 3
# Groups: ID, Gender [3]
# ID Gender Score
# <dbl> <fctr> <dbl>
#1 1 Female NA
#2 1 Male NA
#3 2 Male 20
You could also use arrange
to sort your scores within your groups, and then slice
to select the first row of each group. 您还可以使用“
arrange
对组中的分数进行排序,然后进行slice
以选择每个组的第一行。 That way, if there are only NAs in the group, you would still select the first row: 这样,如果组中仅NA,则仍将选择第一行:
mydat2 %>%
group_by(ID, Gender) %>%
arrange(ID,Gender,Score) %>%
slice(1)
ID Gender Score
<dbl> <fctr> <dbl>
1 1 Female NA
2 1 Male NA
3 2 Male 20
Here is another option with which
and pmin
这是另一种选择与
which
和pmin
mydat2 %>%
group_by(ID, Gender) %>%
slice(pmin(1, which(Score == min(Score, na.rm = TRUE))[1], na.rm = TRUE))
# A tibble: 3 x 3
# Groups: ID, Gender [3]
# ID Gender Score
# <dbl> <fctr> <dbl>
#1 1 Female NA
#2 1 Male NA
#3 2 Male 20
A solution using data.table
使用
data.table
的解决方案
library(data.table)
setDT(mydat2)
mydat2[, .(Score = sort(Score)[1]), by = .(ID, Gender)]
# ID Gender Score
# 1: 1 Male NA
# 2: 1 Female NA
# 3: 2 Male 20
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.