简体   繁体   English

sparklyr spark_apply用户定义的函数错误

[英]sparklyr spark_apply user defined function error

I'm trying to pass a custom R function inside spark_apply but keep running into issues and cant figure out what some of the errors mean. 我正在尝试在spark_apply内部传递自定义R函数,但一直遇到问题,无法弄清某些错误的含义。

library(sparklyr)
sc <- spark_connect(master = "local")
perf_df <- data.frame(predicted = c(5, 7, 20), 
                       actual = c(4, 6, 40))


perf_tbl <- sdf_copy_to(sc = sc,
                        x = perf_df,
                        name = "perf_table")

#custom function
ndcg <- function(predicted_rank, actual_rank) { 
  # x is a vector of relevance scores 
  DCG <- function(y) y[1] + sum(y[-1]/log(2:length(y), base = 2)) 
  DCG(predicted_rank)/DCG(actual_rank) 
} 

#works in R using R data frame
ndcg(perf_df$predicted, perf_df$actual)


    #does not work
  perf_tbl %>%
  spark_apply(function(e) ndcg(e$predicted, e$actual),
              names = "ndcg")

Ok, i'm seeing two possible problems. 好的,我看到两个可能的问题。

(1)-spark_apply prefers functions that have one parameter, a dataframe (1)-spark_apply首选具有一个参数的函数,即数据帧

(2)-you may need to make a package depending on how complex the function in. (2)-根据功能的复杂程度,您可能需要制作一个包装。

let's say you modify ndcg to receive a dataframe as the parameter. 假设您修改ndcg以接收数据帧作为参数。

ndcg <- function(dataset) { 
     predicted_rank <- dataset$predicted
      actual_rank <- dataset$actual
      # x is a vector of relevance scores 
      DCG <- function(y) y[1] + sum(y[-1]/log(2:length(y), base = 2)) 
      DCG(predicted_rank)/DCG(actual_rank) 
} 

And you put that in a package called ndcg_package 然后将其放入名为ndcg_package的软件包中

now your code will be similar to: 现在您的代码将类似于:

spark_apply(perf_tbl, ndcg, packages = TRUE, names = "ndcg")

Doing this from memory, so there may be a few typos, but it'll get you close. 从内存中执行此操作,因此可能会有一些错别字,但它会让您接近。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sparklyr:spark_apply函数在集群模式下不起作用 - sparklyr : spark_apply function is not working in cluster mode 编写一个函数以与来自 sparklyr 的 spark_apply() 一起使用 - Writing a function to use with spark_apply() from sparklyr Sparklyr spark_apply 函数在相等的组上有效运行 - Sparklyr spark_apply function on equal groups to run efficiently R中的匿名函数使用sparklyr spark_apply - Anonymous function in R using sparklyr spark_apply Sparklyr无法引用spark_apply中的表 - Sparklyr cannot reference table in spark_apply 在 spark_apply() 函数 sparklyr 中应用具有多个参数的 UDF - apply UDF with more than one argument in spark_apply() function sparklyr r sparklyr spark_apply错误:org.apache.spark.sql.AnalysisException:引用&#39;id&#39;不明确 - r sparklyr spark_apply Error: org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous Sparklyr的spark_apply函数似乎在单个执行程序上运行,并且在中等大小的数据集上失败 - Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset 使用Sparklyr中的spark_apply在Hadoop中运行系统命令 - Running a system command in Hadoop using spark_apply from sparklyr sparklyr :: spark_apply()中的名称使用`dplyr :: mutate()` - colnames in `sparklyr::spark_apply()` using `dplyr::mutate()`
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM