简体   繁体   中英

colnames in `sparklyr::spark_apply()` using `dplyr::mutate()`

Assuming sc is an existing spark(lyr) connection, the names given in dplyr::mutate() are ignored:

iris_tbl <- sdf_copy_to(sc, iris)
iris_tbl %>% 
  spark_apply(function(e){
    library(dplyr)
    e %>% mutate(slm = median(Sepal_Length))
  })

## Source:   table<sparklyr_tmp_60a41ac01b4e> [?? x 6]
## Database: spark_connection
#   Sepal_Length Sepal_Width Petal_Length Petal_Width Species    X6
#          <dbl>       <dbl>        <dbl>       <dbl>   <chr> <dbl>
# 1          5.1         3.5          1.4         0.2  setosa   5.8
# 2          4.9         3.0          1.4         0.2  setosa   5.8
# 3          4.7         3.2          1.3         0.2  setosa   5.8
# ...

A workaround would be to provide the names using the columns argument:

iris_tbl %>% 
  spark_apply(function(e){
    library(dplyr)
    e %>% mutate(slm = median(Sepal_Length))
  }, columns = c(colnames(iris), "slm"))

## Source:   table<sparklyr_tmp_60a4126692e7> [?? x 6]
## Database: spark_connection
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   slm
#          <dbl>       <dbl>        <dbl>       <dbl>   <chr> <dbl>
# 1          5.1         3.5          1.4         0.2  setosa   5.8
# 2          4.9         3.0          1.4         0.2  setosa   5.8
# 3          4.7         3.2          1.3         0.2  setosa   5.8
# ...

Is it a bug?

Here the sessionInfo()

Oracle Distribution of R version 3.3.0  (--)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Oracle Linux Server 7.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2        tidyr_0.7.2         dbplot_0.2.0        rlang_0.1.4        
 [5] anytime_0.3.0       jsonlite_1.5        magrittr_1.5        ggplot2_2.2.1      
 [9] DBI_0.7             dtplyr_0.0.2        dplyr_0.7.4         kudusparklyr_0.1.0 
[13] sparklyr_0.7.0      data.table_1.10.4-3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14       dbplyr_1.1.0       plyr_1.8.4         bindr_0.1         
 [5] base64enc_0.1-3    tools_3.3.0        digest_0.6.12      gtable_0.2.0      
 [9] tibble_1.3.4       nlme_3.1-127       lattice_0.20-33    pkgconfig_2.0.1   
[13] psych_1.7.8        shiny_1.0.5        rstudioapi_0.7     yaml_2.1.16       
[17] parallel_3.3.0     withr_2.1.0        httr_1.3.1         stringr_1.2.0     
[21] rprojroot_1.2      grid_3.3.0         glue_1.2.0         R6_2.2.2          
[25] foreign_0.8-66     purrr_0.2.4        reshape2_1.4.2     scales_0.5.0      
[29] backports_1.1.1    htmltools_0.3.6    assertthat_0.2.0   mnormt_1.5-5      
[33] RApiDatetime_0.0.3 colorspace_1.3-2   mime_0.5           xtable_1.8-2      
[37] httpuv_1.3.5       config_0.2         stringi_1.1.6      openssl_0.9.9     
[41] munsell_0.4.3      lazyeval_0.2.1     broom_0.4.3

I know, that it's an old R version, but that's not up to me ...

That's how it's designed. This link states that:

By default spark_apply() derives the column names from the input Spark data frame. Use the names argument to rename or add new columns.

trees_tbl %>%
  spark_apply(
              function(e) data.frame(2.54 * e$Girth, e),
              names = c("Girth(cm)", colnames(trees)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM