![](/img/trans.png)
[英]Running a system command in Hadoop using spark_apply from sparklyr
[英]colnames in `sparklyr::spark_apply()` using `dplyr::mutate()`
假设sc
是现有的spark(lyr)连接,则dplyr::mutate()
中给定的名称将被忽略:
iris_tbl <- sdf_copy_to(sc, iris)
iris_tbl %>%
spark_apply(function(e){
library(dplyr)
e %>% mutate(slm = median(Sepal_Length))
})
## Source: table<sparklyr_tmp_60a41ac01b4e> [?? x 6]
## Database: spark_connection
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species X6
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 5.8
# 2 4.9 3.0 1.4 0.2 setosa 5.8
# 3 4.7 3.2 1.3 0.2 setosa 5.8
# ...
一种解决方法是使用columns
参数提供名称:
iris_tbl %>%
spark_apply(function(e){
library(dplyr)
e %>% mutate(slm = median(Sepal_Length))
}, columns = c(colnames(iris), "slm"))
## Source: table<sparklyr_tmp_60a4126692e7> [?? x 6]
## Database: spark_connection
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species slm
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 5.8
# 2 4.9 3.0 1.4 0.2 setosa 5.8
# 3 4.7 3.2 1.3 0.2 setosa 5.8
# ...
是虫子吗?
这是sessionInfo()
Oracle Distribution of R version 3.3.0 (--)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Oracle Linux Server 7.2
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 tidyr_0.7.2 dbplot_0.2.0 rlang_0.1.4
[5] anytime_0.3.0 jsonlite_1.5 magrittr_1.5 ggplot2_2.2.1
[9] DBI_0.7 dtplyr_0.0.2 dplyr_0.7.4 kudusparklyr_0.1.0
[13] sparklyr_0.7.0 data.table_1.10.4-3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 dbplyr_1.1.0 plyr_1.8.4 bindr_0.1
[5] base64enc_0.1-3 tools_3.3.0 digest_0.6.12 gtable_0.2.0
[9] tibble_1.3.4 nlme_3.1-127 lattice_0.20-33 pkgconfig_2.0.1
[13] psych_1.7.8 shiny_1.0.5 rstudioapi_0.7 yaml_2.1.16
[17] parallel_3.3.0 withr_2.1.0 httr_1.3.1 stringr_1.2.0
[21] rprojroot_1.2 grid_3.3.0 glue_1.2.0 R6_2.2.2
[25] foreign_0.8-66 purrr_0.2.4 reshape2_1.4.2 scales_0.5.0
[29] backports_1.1.1 htmltools_0.3.6 assertthat_0.2.0 mnormt_1.5-5
[33] RApiDatetime_0.0.3 colorspace_1.3-2 mime_0.5 xtable_1.8-2
[37] httpuv_1.3.5 config_0.2 stringi_1.1.6 openssl_0.9.9
[41] munsell_0.4.3 lazyeval_0.2.1 broom_0.4.3
我知道这是旧的R版本,但这不取决于我...
这就是它的设计方式。 该链接指出:
默认情况下,spark_apply()从输入的Spark数据帧中获取列名。 使用names参数重命名或添加新列。
trees_tbl %>%
spark_apply(
function(e) data.frame(2.54 * e$Girth, e),
names = c("Girth(cm)", colnames(trees)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.