简体   繁体   English

ifelse 语句将值分配给新列,使用数值列表

[英]ifelse statement to assign values to a new column, working with lists of numeric values

I have a dataframe that looks something like this:我有一个看起来像这样的 dataframe:

# Minimal example dataframe

identifier <- c(
  "A",
  "B",
  "C",
  "D",
  "E",
  "F"
)

value_1 <- c(
  "1231811, 1231877",
  "1231911, 1233069, 1232767",
  "1231919",
  NA,
  "1232135, 1233145",
  NA
)

value_2 <- c(
  1231811,
  190477,
  922661,
  950711,
  992647,
  NA
  
)

value_3 <- c(
  1231877,
  1233069,
  9774041,
  9774041,
  1314063,
  1231379
  
)

test_df <- data.frame(identifier, value_1, value_2, value_3)

  identifier                   value_1 value_2 value_3
1          A          1231811, 1231877 1231811 1231877
2          B 1231911, 1233069, 1232767  190477 1233069
3          C                   1231919  922661 9774041
4          D                      <NA>  950711 9774041
5          E          1232135, 1233145  992647 1314063
6          F                      <NA>    <NA> 1231379

I want to create a new column, "final_value," and fill it in with a single value from value_1, value_2, or value_3 following a hierarchy that prioritizes value_1 values that match values in value_2 followed by value_3.我想创建一个新列“final_value”,并使用 value_1、value_2 或 value_3 中的单个值填充它,该层次结构优先考虑与 value_2 中的值匹配的 value_1 值,然后是 value_3。 If value_1 is not NA and does not have values that match anything in value_2 or value_3, I want to fill final_value with the first value in the comma-separated value_1 string.如果 value_1 不是 NA 并且没有与 value_2 或 value_3 中的任何内容匹配的值,我想用逗号分隔的 value_1 字符串中的第一个值填充 final_value。 If value_1 is NULL, fill final_value with value_2 or, if that is also null, fill in with value_3.如果 value_1 是 NULL,用 value_2 填充 final_value,或者如果这也是 null,则用 value_3 填充。 The final dataframe would look like this:最终的 dataframe 如下所示:

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1231877 1231811 # 1231811 from value_1 matches value_2 (preferred match)
2          B 1231911, 1233069, 1232767  190477 1233069 1233069 # no values from value_1 match value_2; however, 1233069 from value_1 matches value_3
3          C                   1231919  922661 9774041 1231919 # no values from value_1 match other columns; just fill with value_1
4          D                      <NA>  950711 9774041 950711  # value_1 is NA, so fill in with value_2
5          E          1232135, 1233145  992647 1314063 1232135 # no values from value_1 match other columns, fill with first item from value_1 list
6          F                      <NA>    <NA> 1231379 1231379 # value_1 and value_2 are NA, so fill in with value_3

Here's my approach so far...到目前为止,这是我的方法...

library(purrr)
library(dplyr)

# change value_1 column into a list of numeric values 
test_df <- test_df%>% mutate(value_1 = map(value_1,function(x) (as.numeric(unlist(str_split(x,","))))))

# create a new column to hold the final selected value
test_df$final_value <- NA

# ifelse statement
test_df$final_value <- 
  
  # if any of the elements in value_1 match the value_2 value, fill the new column with value_2
  ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_2, test_df$value_2,
         
         # otherwise, if a value in value_1 matches value_3, fill in with value_3
         ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_3, test_df$value_3,
                
                # if none of the values in value_1 match the other columns, fill in with the first value_1 list value
                ifelse(!is.na(test_df$value_1) & !(test_df$value_1 %in% test_df$value_2) & !(test_df$value_1 %in% test_df$value_3), test_df$value_1, #NOTE: have tried test_df$value_1[1] and test_df$value_1[[1]] without success to get the first list item returned
                       
                       # if value_1 is NA, fill in with value_2
                       ifelse(is.na(test_df$value_1) & !is.na(test_df$value_2), test_df$value_2,
                              
                              # if value_1 is NA and value_2 is NA, fill in with value_3
                              ifelse(is.na(test_df$value_1) & is.na(test_df$value_2) & !is.na(test_df$value_3), test_df$value_3, NA
         
         
  )))))

There are a few problems with the result:结果有几个问题:

  identifier                   value_1 value_2 value_3               final_value
1          A          1231811, 1231877 1231811 1314063          1231811, 1231877
2          B 1231911, 1233069, 1232767  190477 1233069 1231911, 1233069, 1232767
3          C                   1231919  922661 9774041                   1231919
4          D                        NA  950711 9774041                    950711
5          E          1232135, 1233145  992647 1314063          1232135, 1233145
6          F                        NA      NA 1231379                   1231379

The first three lines of the ifelse are not working as anticipated. ifelse 的前三行没有按预期工作。 It is failing to return the matching value_2 or value_3 value in final_value and I also cannot get it to return the first list item from value_1 where there aren't any matching value_2 or value_3 values.它未能在 final_value 中返回匹配的 value_2 或 value_3 值,我也无法让它从 value_1 返回第一个列表项,其中没有任何匹配的 value_2 或 value_3 值。 For the latter, I've tried specifying test_df$value_1[[1]][1] (and similar) but this only returns the first item in the identifer A value_1 list:对于后者,我尝试指定test_df$value_1[[1]][1] (和类似的),但这仅返回标识符 A value_1 列表中的第一项:

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1314063     1231811
2          B 1231911, 1233069, 1232767  190477 1233069     1231811
3          C                   1231919  922661 9774041     1231811
4          D                        NA  950711 9774041      950711
5          E          1232135, 1233145  992647 1314063     1231811
6          F                        NA      NA 1231379     1231379

Any help would be greatly appreciated.任何帮助将不胜感激。

First, nesting ifelse beyond 2-deep generally leads me to suggest case_when .首先,嵌套ifelse超过 2 层通常会导致我建议case_when However, in this case I think there is a much better solution without that:但是,在这种情况下,我认为没有那个更好的解决方案:

func func <- function(A, ...) {
  if (length(A) == 1L && is.na(A)) {
    if (length(list(...))) na.omit(unlist(list(...)))[1] else NA
  } else {
    L <- lapply(list(...), intersect, x = A)
    L <- c(L[lengths(L) > 0], A)
    L[[1]][1]
  }
}

library(dplyr)
test_df %>%
  mutate(
    final_value = mapply(func, strsplit(value_1, "[, ]+"), value_2, value_3)
  )
#   identifier                   value_1 value_2 value_3 final_value
# 1          A          1231811, 1231877 1231811 1231877     1231811
# 2          B 1231911, 1233069, 1232767  190477 1233069     1233069
# 3          C                   1231919  922661 9774041     1231919
# 4          D                      <NA>  950711 9774041      950711
# 5          E          1232135, 1233145  992647 1314063     1232135
# 6          F                      <NA>      NA 1231379     1231379

Because I use ... in func , this handles "0 or more" other value_* variables as you want;因为我在func中使用... ,所以它可以根据需要处理“0个或更多”其他value_*变量; if you have 3 or 30 more, it will apply the same logic.如果您有 3 个或 30 个以上,它将应用相同的逻辑。 Further, the order within ... matters, those listed earlier will be prioritized higher for matches.此外, ...中的顺序很重要,前面列出的顺序将在匹配中优先级更高。

The c(L[lengths(L) > 0], A) ensures (1) we only consider value_* that have non-empty intersections (first portion), and if all of those are empty, we use what is found in A . c(L[lengths(L) > 0], A)确保 (1) 我们只考虑value_*具有非空交集(第一部分),如果所有这些都是空的,我们使用A . (In the unlikely event that A is NA and all value_* are empty, then... you get NA .) (万一ANA并且所有value_*都是空的,那么......你会得到NA 。)

FYI, one inner-step of this is to split your strings of comma-separated numbers into a list-column using strsplit .仅供参考,其中一个内部步骤是使用strsplit将逗号分隔的数字字符串拆分为列表列。 If you're going to do more and similar operations that need to work on individual components within, you may prefer to keep it as such using mutate(value_1 = strsplit(value_1, "[,]+")) (or similar).如果您要执行更多类似的操作,需要在其中的单个组件上工作,您可能更愿意使用mutate(value_1 = strsplit(value_1, "[,]+")) (或类似的)保持它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM