当数据集在Sparklyr中时，为什么不能在dplyr中使用双冒号运算符？

Question

A reproducible example (adapted from @forestfanjoe's answer): 一个可重现的示例（改编自@forestfanjoe的答案）：

library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")

df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2))

df <- copy_to(sc, df, "payment")

> head(df)
# Source: spark<?> [?? x 2]
     id PaymentHistory
* <int>          <dbl>
1     1         -0.138
2     2         -0.249
3     3         -0.805
4     4          1.30 
5     5          1.54 
6     6          0.936

fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > 1,1, PaymentHistory)))}

df %>% fix_PaymentHistory

The error is: 错误是：

 Error in dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory >  : 
 object 'PaymentHistory' not found

I'm using the scope operator because I'm afraid that the name in dplyr will clash with some of the user-defined code. 我正在使用范围运算符，因为恐怕dplyr中的名称将与某些用户定义的代码冲突。 Note that PaymentHistory is a column variable in df . 请注意， PaymentHistory是df的列变量。

The same error is not present when running the following code: 运行以下代码时，不会出现相同的错误：

fix_PaymentHistory <- function(df){
    df %>% mutate(PaymentHistory = if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
}
> df %>% fix_PaymentHistory
# Source: spark<?> [?? x 2]
      id PaymentHistory
 * <int>          <dbl>
 1     1         0     
 2     2         0     
 3     3         0     
 4     4         1     
 5     5         1     
 6     6         0.936 
 7     7         0     
 8     8         0.716 
 9     9         0     
10    10         0.0831
# ... with more rows

Answer 1

TL;DR Because your code doesn't use dplyr::if_else at all. TL; DR因为您的代码根本不使用dplyr::if_else 。

sparklyr , when used as in the example, treats Spark as yet another database and issues queries using dbplyr SQL translation layer . 在本示例中使用sparklyr时，将Spark视为另一个数据库，并使用dbplyr SQL转换层发出查询。

In this context if_else is no treated as a function, but an identifier which is converted to SQL primitives: 在这种情况下， if_else不会被视为函数，而是将标识符转换为SQL原语：

dbplyr::translate_sql(if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
# <SQL> CASE WHEN ("PaymentHistory" < 0.0) THEN (0.0) WHEN NOT("PaymentHistory" < 0.0) THEN (CASE WHEN ("PaymentHistory" > 1.0) THEN (1.0) WHEN NOT("PaymentHistory" > 1.0) THEN ("PaymentHistory") END) END

However if you pass a fully qualified named, it will circumvent this mechanism, try to evaluate the function, and ultimately fail, because the database columns are not in the scope. 但是，如果您传递的是完全限定的名称，它将绕过此机制，尝试评估该函数，最终会失败，因为数据库列不在范围内。

I'm afraid that the name in dplyr will clash with some of the user-defined code. 恐怕dplyr中的名称将与某些用户定义的代码冲突。

As you see, there is no need for dplyr to be in scope here at all - functions called in sparklyr pipelines are either translated to corresponding SQL constructs, or if there is no specific translation rule in place, passed as-is and resolved by Spark SQL engine (this path is used to invoke Spark functions ). 如您所见，这里根本不需要dplyr- sparklyr管道中调用的函数将转换为相应的SQL构造，或者如果没有适当的转换规则，按原样传递并由Spark解析SQL引擎（此路径用于调用Spark函数）。

Of course this mechanism is not specific to sparklyr and you're likely to see the same behavior using other table backed by a database: 当然，此机制并不特定于sparklyr ，使用数据库支持的其他表很可能会看到相同的行为：

library(magrittr)

db <- dplyr::src_sqlite(":memory:", TRUE)
dplyr::copy_to(db, mtcars)

db %>% dplyr::tbl("mtcars") %>% dplyr::mutate(dplyr::if_else(mpg < 20, 1, 0))

Error in dplyr::if_else(mpg < 20, 1, 0) : object 'mpg' not found

当数据集在Sparklyr中时，为什么不能在dplyr中使用双冒号运算符？

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-10-24 22:01:15

当数据集在Sparklyr中时，为什么不能在dplyr中使用双冒号运算符？

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-10-24 22:01:15

解决方案1
4 已采纳 2018-10-24 22:01:15