[英]Why can't I use double colon operator with dplyr when the dataset is in sparklyr?
A reproducible example (adapted from @forestfanjoe's answer): 一个可重现的示例(改编自@forestfanjoe的答案):
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2))
df <- copy_to(sc, df, "payment")
> head(df)
# Source: spark<?> [?? x 2]
id PaymentHistory
* <int> <dbl>
1 1 -0.138
2 2 -0.249
3 3 -0.805
4 4 1.30
5 5 1.54
6 6 0.936
fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > 1,1, PaymentHistory)))}
df %>% fix_PaymentHistory
The error is: 错误是:
Error in dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > :
object 'PaymentHistory' not found
I'm using the scope operator because I'm afraid that the name in dplyr
will clash with some of the user-defined code. 我正在使用范围运算符,因为恐怕
dplyr
中的名称将与某些用户定义的代码冲突。 Note that PaymentHistory
is a column variable in df
. 请注意,
PaymentHistory
是df
的列变量。
The same error is not present when running the following code: 运行以下代码时,不会出现相同的错误:
fix_PaymentHistory <- function(df){
df %>% mutate(PaymentHistory = if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
}
> df %>% fix_PaymentHistory
# Source: spark<?> [?? x 2]
id PaymentHistory
* <int> <dbl>
1 1 0
2 2 0
3 3 0
4 4 1
5 5 1
6 6 0.936
7 7 0
8 8 0.716
9 9 0
10 10 0.0831
# ... with more rows
TL;DR Because your code doesn't use dplyr::if_else
at all. TL; DR因为您的代码根本不使用
dplyr::if_else
。
sparklyr
, when used as in the example, treats Spark as yet another database and issues queries using dbplyr
SQL translation layer . 在本示例中使用
sparklyr
时,将Spark视为另一个数据库,并使用dbplyr
SQL转换层发出查询。
In this context if_else
is no treated as a function, but an identifier which is converted to SQL primitives: 在这种情况下,
if_else
不会被视为函数,而是将标识符转换为SQL原语:
dbplyr::translate_sql(if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
# <SQL> CASE WHEN ("PaymentHistory" < 0.0) THEN (0.0) WHEN NOT("PaymentHistory" < 0.0) THEN (CASE WHEN ("PaymentHistory" > 1.0) THEN (1.0) WHEN NOT("PaymentHistory" > 1.0) THEN ("PaymentHistory") END) END
However if you pass a fully qualified named, it will circumvent this mechanism, try to evaluate the function, and ultimately fail, because the database columns are not in the scope. 但是,如果您传递的是完全限定的名称,它将绕过此机制,尝试评估该函数,最终会失败,因为数据库列不在范围内。
I'm afraid that the name in dplyr will clash with some of the user-defined code.
恐怕dplyr中的名称将与某些用户定义的代码冲突。
As you see, there is no need for dplyr to be in scope here at all - functions called in sparklyr
pipelines are either translated to corresponding SQL constructs, or if there is no specific translation rule in place, passed as-is and resolved by Spark SQL engine (this path is used to invoke Spark functions ). 如您所见,这里根本不需要dplyr-
sparklyr
管道中调用的函数将转换为相应的SQL构造,或者如果没有适当的转换规则,按原样传递并由Spark解析SQL引擎(此路径用于调用Spark函数 )。
Of course this mechanism is not specific to sparklyr
and you're likely to see the same behavior using other table backed by a database: 当然,此机制并不特定于
sparklyr
,使用数据库支持的其他表很可能会看到相同的行为:
library(magrittr)
db <- dplyr::src_sqlite(":memory:", TRUE)
dplyr::copy_to(db, mtcars)
db %>% dplyr::tbl("mtcars") %>% dplyr::mutate(dplyr::if_else(mpg < 20, 1, 0))
Error in dplyr::if_else(mpg < 20, 1, 0) : object 'mpg' not found
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.