sparklyr :: spark_apply（）中的名称使用`dplyr :: mutate（）`

Question

假设sc是现有的spark（lyr）连接，则dplyr::mutate()中给定的名称将被忽略：

iris_tbl <- sdf_copy_to(sc, iris)
iris_tbl %>% 
  spark_apply(function(e){
    library(dplyr)
    e %>% mutate(slm = median(Sepal_Length))
  })

## Source:   table<sparklyr_tmp_60a41ac01b4e> [?? x 6]
## Database: spark_connection
#   Sepal_Length Sepal_Width Petal_Length Petal_Width Species    X6
#          <dbl>       <dbl>        <dbl>       <dbl>   <chr> <dbl>
# 1          5.1         3.5          1.4         0.2  setosa   5.8
# 2          4.9         3.0          1.4         0.2  setosa   5.8
# 3          4.7         3.2          1.3         0.2  setosa   5.8
# ...

一种解决方法是使用columns参数提供名称：

iris_tbl %>% 
  spark_apply(function(e){
    library(dplyr)
    e %>% mutate(slm = median(Sepal_Length))
  }, columns = c(colnames(iris), "slm"))

## Source:   table<sparklyr_tmp_60a4126692e7> [?? x 6]
## Database: spark_connection
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   slm
#          <dbl>       <dbl>        <dbl>       <dbl>   <chr> <dbl>
# 1          5.1         3.5          1.4         0.2  setosa   5.8
# 2          4.9         3.0          1.4         0.2  setosa   5.8
# 3          4.7         3.2          1.3         0.2  setosa   5.8
# ...

是虫子吗？

这是sessionInfo()

Oracle Distribution of R version 3.3.0  (--)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Oracle Linux Server 7.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2        tidyr_0.7.2         dbplot_0.2.0        rlang_0.1.4        
 [5] anytime_0.3.0       jsonlite_1.5        magrittr_1.5        ggplot2_2.2.1      
 [9] DBI_0.7             dtplyr_0.0.2        dplyr_0.7.4         kudusparklyr_0.1.0 
[13] sparklyr_0.7.0      data.table_1.10.4-3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14       dbplyr_1.1.0       plyr_1.8.4         bindr_0.1         
 [5] base64enc_0.1-3    tools_3.3.0        digest_0.6.12      gtable_0.2.0      
 [9] tibble_1.3.4       nlme_3.1-127       lattice_0.20-33    pkgconfig_2.0.1   
[13] psych_1.7.8        shiny_1.0.5        rstudioapi_0.7     yaml_2.1.16       
[17] parallel_3.3.0     withr_2.1.0        httr_1.3.1         stringr_1.2.0     
[21] rprojroot_1.2      grid_3.3.0         glue_1.2.0         R6_2.2.2          
[25] foreign_0.8-66     purrr_0.2.4        reshape2_1.4.2     scales_0.5.0      
[29] backports_1.1.1    htmltools_0.3.6    assertthat_0.2.0   mnormt_1.5-5      
[33] RApiDatetime_0.0.3 colorspace_1.3-2   mime_0.5           xtable_1.8-2      
[37] httpuv_1.3.5       config_0.2         stringi_1.1.6      openssl_0.9.9     
[41] munsell_0.4.3      lazyeval_0.2.1     broom_0.4.3

我知道这是旧的R版本，但这不取决于我...

Answer 1

这就是它的设计方式。 该链接指出：

默认情况下，spark_apply（）从输入的Spark数据帧中获取列名。 使用names参数重命名或添加新列。

trees_tbl %>%
  spark_apply(
              function(e) data.frame(2.54 * e$Girth, e),
              names = c("Girth(cm)", colnames(trees)))

sparklyr :: spark_apply（）中的名称使用`dplyr :: mutate（）`

问题描述

1 个解决方案

解决方案1
0 2018-05-02 09:01:59

sparklyr :: spark_apply（）中的名称使用`dplyr :: mutate（）`

问题描述

1 个解决方案

解决方案1 0 2018-05-02 09:01:59

解决方案1
0 2018-05-02 09:01:59