简体   繁体   English

编写一个函数以与来自 sparklyr 的 spark_apply() 一起使用

[英]Writing a function to use with spark_apply() from sparklyr

test <- data.frame('prod_id'= c("shoe", "shoe", "shoe", "shoe", "shoe", "shoe", "boat", "boat","boat","boat","boat","boat"), 
               'seller_id'= c("a", "b", "c", "d", "e", "f", "a","g", "h", "r", "q", "b"), 
               'Dich'= c(1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
               'price' = c(120, 20, 10, 4, 3, 4, 30, 43, 56, 88, 75, 44)
                )
test

       prod_id seller_id Dich price
 1     shoe         a    1   120
 2     shoe         b    0    20
 3     shoe         c    0    10
 4     shoe         d    0     4
 5     shoe         e    0     3
 6     shoe         f    0     4
 7     boat         a    0    30
 8     boat         g    0    43
 9     boat         h    1    56
10     boat         r    0    88
11     boat         q    0    75
12     boat         b    0    44

I wanted to create a new column that takes the difference between observations in the price column based on the value of Dich where each observation takes its difference from the observation where Dich==1 within each prod_id group.我想创建一个新列,它根据 Dich 的值获取价格列中观察值之间的差异,其中每个观察值都取其与每个 prod_id 组中 Dich==1 的观察值的差异。 The syntax for doing that is below.这样做的语法如下。

test %>% 
group_by(prod_id) %>% 
mutate(diff_p = if(any(Dich ==1)) price - price[Dich == 1] else NA)

       prod_id seller_id Dich price diff_p
 1     shoe         a    1   120      0
 2     shoe         b    0    20     -100
 3     shoe         c    0    10     -110
 4     shoe         d    0     4     -116
 5     shoe         e    0     3     -117
 6     shoe         f    0     4     -116
 7     boat         a    0    30     -26
 8     boat         g    0    43     -13
 9     boat         h    1    56       0
10     boat         r    0    88      32
11     boat         q    0    75      19
12     boat         b    0    44     -12

Now I would like to create a function that uses the same syntax where I can use the function on a new dataframe and get the same results with sparklyr::spark_apply().现在我想创建一个使用相同语法的函数,我可以在新数据帧上使用该函数并使用 sparklyr::spark_apply() 获得相同的结果。

trans <- function(e) {e %>%
         group_by(prod_id) %>% 
         mutate(diff_p = if(any(Dich ==1)) price -price[Dich == 1] else NA)
         }

On their website, rstudio discusses the use of applying R functions to spark objects.在他们的网站上,rstudio 讨论了将 R 函数应用于 spark 对象的用法。

https://spark.rstudio.com/guides/distributed-r/ https://spark.rstudio.com/guides/distributed-r/

Here is an example of a function that scales all of the columns of a spark dataframe.下面是一个缩放 spark 数据帧的所有列的函数示例。

 trees_tbl %>%
 spark_apply(function(e) scale(e))

I'm wondering how I might write the function above in the format explained for use with spark_apply().我想知道如何以解释用于 spark_apply() 的格式编写上面的函数。 It would be helpful if you could explain how to include e in a function, - what does e stand in for?如果您能解释如何在函数中包含 e 将会很有帮助,- e 代表什么?

All the packages need to be in the worker and functions need to be found (but %>% needs you to tell the worker library(magrittr) ), one way that can work is:所有的包都需要在 worker 中并且需要找到函数(但是%>%需要你告诉 worker library(magrittr) ),一种可行的方法是:

trans <- function(e) {
    library(magrittr)

    e %>%
        dplyr::group_by(prod_id) %>% 
        dplyr::mutate(diff_p = if(any(Dich ==1)) price -price[Dich == 1] else NA)
}

sparklyr::spark_apply(
  x = test_sf, 
  f = trans)
# Source: spark<?> [?? x 5]
   prod_id seller_id  Dich price diff_p
   <chr>   <chr>     <dbl> <dbl>  <dbl>
 1 shoe    a             1   120      0
 2 shoe    b             0    20   -100
 3 shoe    c             0    10   -110
 4 shoe    d             0     4   -116
 5 shoe    e             0     3   -117
 6 shoe    f             0     4   -116
 7 boat    a             0    30    -26
 8 boat    g             0    43    -13
 9 boat    h             1    56      0
10 boat    r             0    88     32
# … with more rows
# ℹ Use `print(n = ...)` to see more rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sparklyr spark_apply用户定义的函数错误 - sparklyr spark_apply user defined function error 使用Sparklyr中的spark_apply在Hadoop中运行系统命令 - Running a system command in Hadoop using spark_apply from sparklyr sparklyr:spark_apply函数在集群模式下不起作用 - sparklyr : spark_apply function is not working in cluster mode Sparklyr spark_apply 函数在相等的组上有效运行 - Sparklyr spark_apply function on equal groups to run efficiently R中的匿名函数使用sparklyr spark_apply - Anonymous function in R using sparklyr spark_apply Sparklyr无法引用spark_apply中的表 - Sparklyr cannot reference table in spark_apply 在 spark_apply() 函数 sparklyr 中应用具有多个参数的 UDF - apply UDF with more than one argument in spark_apply() function sparklyr Sparklyr的spark_apply函数似乎在单个执行程序上运行,并且在中等大小的数据集上失败 - Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset sparklyr :: spark_apply()中的名称使用`dplyr :: mutate()` - colnames in `sparklyr::spark_apply()` using `dplyr::mutate()` r sparklyr spark_apply错误:org.apache.spark.sql.AnalysisException:引用&#39;id&#39;不明确 - r sparklyr spark_apply Error: org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM