简体   繁体   English

当我使用相对引用时,有人可以在R的semi_join函数中解释“意外的'=”消息吗?

[英]Can someone explain the 'unexpected '='' message in my semi_join function in R when I use relative references?

I'm trying to build a script in R that will join on different fields based on user input. 我正在尝试在R中构建一个脚本,该脚本将根据用户输入加入不同的字段。 I'm running version 0.7.6 of dplyr through tidyverse (1.2.1). 我正在通过tidyverse(1.2.1)运行dplyr的0.7.6版本。

I could build multiple mostly identical join statements and reference different ones based on the input, but that seems inelegant. 我可以构建多个基本相同的联接语句,并根据输入引用不同的联接语句,但这似乎很不雅致。 Below is an example with commentary underneath that. 以下是带有注释的示例。 I'm still kind of new to R, so I apologize if this itself is inelegant: 我对R还是很陌生,所以如果这本身不太优雅,我深表歉意:

library(tidyverse)
df <- tibble(
  a = letters[1:20],
  b = c(1:5,1:5,1:5,1:5)
)

ref <- tibble(
  let_ref_col = c('e','g','b','d','f'),
  num_ref_col = c(2,4,NA,NA,NA)
)

df2 <- semi_join(df,ref,c('b'='num_ref_col'))

df3 <- semi_join(df,ref,c('b'=colnames(ref)[2]))
df2==df3 #just to check

df4 <- semi_join(df,ref,c(colnames(df)[2]=colnames(ref)[2]))

df2 will return 8 rows where column b in df is 2 or 4. df2将返回8行,其中df中的b列为2或4。

R doesn't seem to mind me generalizing the second join variable name, as evidenced by `df3. R似乎不介意让我泛化第二个连接变量名,如df3所示。

When I try to apply the exact same logic to the first variable, I get an error message from df4 : 当我尝试将完全相同的逻辑应用于第一个变量时,我从df4收到一条错误消息:

Error: unexpected '=' in "df4 <- inner_join(df,ref,c(colnames(df)[2]="

I'd love to be able to have a relative reference for both fields if possible. 如果可能的话,我希望能够对这两个领域都有一个相对的参考。 Something like: 就像是:

JOIN_DESIRED <- 2
df5 <- semi_join(df,ref,c(colnames(df)[JOIN_DESIRED] = colnames(ref)[JOIN_DESIRED])

Which can be changed to 1 to join by letters instead of numbers. 可以将其更改为1,以字母代替数字。

Here is a workaround. 这是一种解决方法。 We can use names<- to assign the names. 我们可以使用names<-来分配名称。

df4 <- semi_join(df, ref, `names<-`(colnames(ref)[2], colnames(df)[2]))

identical(df2, df4)
# [1] TRUE

identical(df3, df4)
# [1] TRUE

You're doing a lot of things on one line with your last line semi_join(df,ref,c(colnames(df)[2]=colnames(ref)[2])) . 在最后一行semi_join(df,ref,c(colnames(df)[2]=colnames(ref)[2]))上,您正在一行上做很多事情。 Specifically in this bit: colnames(df)[2]=colnames(ref)[2] there are a lot of operations that could run afoul of R's lazy execution logic . 特别是在此位: colnames(df)[2]=colnames(ref)[2]有很多操作可能会违反R的惰性执行逻辑 Here's how I might program it: 这是我的编程方法:

library(tidyverse)

df <- tibble(
  a = letters[1:20],
  b = c(1:5,1:5,1:5,1:5)
)

ref <- tibble(
  let_ref_col = c('e','g','b','d','f'),
  num_ref_col = c(2,4,NA,NA,NA)
)

semi_join_by_column_index <- function(df1, df2, idx) {
  original_name <- names(df1)[idx]

  names(df1)[idx] <- "join_column"
  names(df2)[idx] <- "join_column"

  new_df <- semi_join(df1, df2, by = "join_column")

  new_idx <- match("join_column", names(new_df))
  names(new_df)[new_idx] <- original_name

  return(new_df)
}

merged_df <- semi_join_by_column_index(df, ref, idx = 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM