简体   繁体   English

R:如何根据另一个变量中的值选择dplyr :: distinct()保留哪一行?

[英]R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

The real life problem: I have subjects with MRI scan data. 现实生活中的问题:我有MRI扫描数据的科目。 Some of which have been scanned multiple times (separate rows). 其中一些已被多次扫描(单独的行)。 Some of those were scanned under different protocols each time. 其中一些每次都在不同的协议下进行扫描。 I want to keep all unique rows by subject ID, and if a subject was scanned under two different protocols, I want it to prefer one over the other. 我希望按主题ID保留所有唯一的行,如果主题是在两种不同的协议下扫描的,我希望它优先于另一种。

The toy example: 玩具示例:

library(dplyr)  
df <- tibble(
        id = c("A", "A", "B", "C", "C", "D"), 
        protocol = c("X", "Y", "X", "X", "X", "Y"),
        date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"), 
                 by="days")),
        var = 1:6)

I want to return a data frame with all unique subjects by id. 我想通过id返回包含所有唯一主题的数据框。 When it comes to a duplicate value, instead of automatically keeping the first entry, I want it to keep the entry with "Y" as the protocol if it has that choice, but not to get rid of rows with "X" otherwise. 当涉及到一个重复的值,而不是自动保留第一个条目时,我希望它保持条目以“Y”作为协议,如果它有这个选择,但不要除去带有“X”的行。

In the example, it would keep rows 2, 3, 4, and 6. 在该示例中,它将保留行2,3,4和6。

I prefer dplyr, but am open to other suggestions. 我更喜欢dplyr,但我愿意接受其他建议。

Nothing that I've tried even begins to work: 我尝试过的任何东西都没有开始工作:

df %>% distinct(id, .keep_all = TRUE) #Nope! 

df %>% distinct(id, protocol == "Y", .keep_all = TRUE) #Nope!  

df$protocol <- factor(df$protocol, levels = c("Y", "X"))
df %>% distinct(id, .keep_all = TRUE) #Nope!  

df %>% group_by(id) %>% filter(protocol == "Y") #Nope!

Two good answers: @RobJensen suggests 两个好的答案:@RobJensen建议

df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)  

If I have multiple protocols and wish to assign an order to which they will be chosen, I can create a new variable where I assign the protocols an integer in order of preference, then use the suggestion from @joran 如果我有多个协议并希望分配一个它们将被选择的顺序,我可以创建一个新变量,我按优先顺序为协议分配一个整数,然后使用@joran中的建议

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% slice(1)  

Thanks! 谢谢!

Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference variable to give an ordering of what you'd prefer to be selected if Y isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically. 按字母顺序排列在所述的简单情况下工作,但是如果你想要,你可以添加一个protocol_preference变量,以便在Y不可用时给出你想要选择的内容的顺序,并选择“Y”,即使它没有' t按字母顺序排序时恰好是最后一个协议值。

Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector 建立@davechilders答案和@Nathan Werth创建基于“重要性顺序”向量的因子的想法

order_of_importance <- c("Y", "Z", "X")

    df2 %>%
      mutate(protocol = factor(protocol, order_of_importance)) %>%
      arrange(id, protocol) %>%
      distinct(id, .keep_all = TRUE)

Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do 或者,如果您只想选择“Y”并且如果“Y”不可用,则不会选择所选内容,您可以执行此操作

df %>% 
    arrange(id, desc(protocol == 'Y')) %>% 
    distinct(id, .keep_all = TRUE)

There's probably a faster way (almost certainly with data.table) but this would be the naive direct approach in dplyr I think: 有可能是一个更快的方式(几乎可以肯定data.table),但这将是dplyr天真直接的方法,我认为:

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% do(head(.,1))

As @Gregor noted below (now deleted), slice(1) is probably a better idiom for do(head(.,1)) . 正如下面提到的@Gregor(现已删除), slice(1)可能是do(head(.,1))更好的习语。

如果希望输出是不是grouped_df的tibble,则可以在不使用group_by()情况下实现此目的。

df %>% arrange(id, desc(protocol)) %>% distinct(id, .keep_all = TRUE)

You could break the process into two steps: grab the must-haves, grab whatever for the other IDs, and combine. 您可以将该过程分为两个步骤:抓住必备品,抓取其他ID的任何内容,然后合并。

distinct_y <- df %>%
  filter(protocol == "Y") %>%
  distinct(id, .keep_all = TRUE)

distinct_other <- df %>%
  anti_join(distinct_y, "id") %>%
  distinct(id, .keep_all = TRUE)

distinct_combined <- rbind(distinct_y, distinct_other)

If you'd like to generalize it from a "one above all" to an ordering of value, I suggest making protocol a factor. 如果你想将它从“一个首先”推广到价值排序,我建议将protocol作为一个因素。

For example, suppose there are three protocols: X, Y, and Z. Y is the best, Z is better than X, and you only want X if there's nothing better. 例如,假设有三种协议:X,Y和Z.Y是最好的,Z优于X,如果没有更好的东西,你只需要X.

# Only difference is the best protocol for C will now be Z.
df2 <- tibble(
  id = c("A", "A", "B", "C", "C", "D"),
  protocol = c("X", "Y", "X", "X", "Z", "Y"),
  date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"),
               by="days")),
  var = 1:6
)

order_of_importance <- c("Y", "Z", "X")

df2 %>%
  mutate(protocol = factor(protocol, order_of_importance)) %>%
  group_by(id) %>%
  arrange(protocol) %>%
  slice(1)
# # A tibble: 4 x 4
# # Groups: id [4]
#   id    protocol date         var
#   <chr> <fctr>   <date>     <int>
# 1 A     Y        2018-01-02     2
# 2 B     X        2018-01-03     3
# 3 C     Z        2018-01-05     5
# 4 D     Y        2018-01-06     6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在R中,如何让散点图根据另一个变量的值为点选择颜色? - in R, how do I have the scatterplot choose a color for a point based on the value of another variable? 如何基于r中的另一行将值分配给数据框中的行? - How do I assign value to a row in a dataframe based on another in r? dplyr select / distinct一直在我未选择的列中 - dplyr select/distinct keeps brining in a column I don't choose 如何基于另一个变量的值使用dplyr :: Distinct - How to use dplyr::Distinct Based on the Values of Another Variable 如何使用 dplyr 根据另一列中的值选择列? - How do I select column based on value in another column with dplyr? 如何根据 R 中另一个变量的值更改变量中的值? - How do I change a value within a variable based on the value of another variable in R? 如何根据 R 中的另一个变量创建条件变量? - How do I create a conditional variable based on another variable in R? 在R中,如何根据前一行(或后一行)的变化为变量设置值? - In R, how do I set a value for a variable based on the change from the prior (or following) row? R dplyr日期过滤器基于另一个变量 - R dplyr date filter based on another variable 如何在出现此值的第一个实例的行中排除所有值? (R Dplyr) - How do I exclude all values in a row where the first instance of this value occurs? (R Dplyr)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM