简体   繁体   English

Sparklyr无法引用spark_apply中的表

[英]Sparklyr cannot reference table in spark_apply

I want to use spark_apply to iterate through a number of data processes for feature generation. 我想使用spark_apply遍历用于特征生成的许多数据过程。 To do that I need to reference tables already loaded into spark but get the following error: 为此,我需要引用已经加载到spark中的表,但是会出现以下错误:

ERROR sparklyr: RScript (3076) terminated unexpectedly: object 'ref_table' not found 错误sparklyr:RScript(3076)意外终止:找不到对象'ref_table'

A reproducible example: 一个可重现的示例:

ref_table <-   sdf_along(sc, 10)
apply_table <- sdf_along(sc, 10)

spark_apply(x = apply_table, 
            f = function(x) {
              c(x, ref_table)
            })

I know I can reference libraries inside the function, but not sure how to call up the data. 我知道我可以在函数内部引用库,但不确定如何调用数据。 I am running a local spark cluster through rstudio. 我正在通过rstudio运行本地Spark集群。

Unfortunately the failure is to be expected here. 不幸的是,这里的失败是可以预期的。

Apache Spark, and because of that platforms based on it, doesn't support nested transformations like this one. 由于基于Apache Spark的平台,Apache Spark不支持这种嵌套转换。 You cannot use nested transformation, distributed objects or Spark context ( spark_connection in case of sparklyr ) from a worker code. 您不能从工作程序代码中使用嵌套转换,分布式对象或Spark上下文(在spark_connection情况下为sparklyr )。

For a detailed explanation please check my answer to Is there a reason not to use SparkContext.getOrCreate when writing a spark job? 有关详细说明,请检查我的回答。 是否有理由在编写Spark作业时不使用SparkContext.getOrCreate? .

Your question doesn't give enough context to determine the best course of action here, but in general there two possible solutions: 您的问题并没有提供足够的背景信息来确定最佳的解决方案,但总的来说,有两种可能的解决方案:

  • As long as one of the datasets is small enough to be stored in memory, use it directly in the closure as a plain R object. 只要其中一个数据集足够小,可以存储在内存中,就可以直接在闭包中将其用作普通R对象。
  • Reformulate your problem as a join or the Cartesian product (Spark's crossJoin ). 重新制定您的问题作为join或笛卡尔乘积(斯巴克的crossJoin )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 r sparklyr spark_apply错误:org.apache.spark.sql.AnalysisException:引用&#39;id&#39;不明确 - r sparklyr spark_apply Error: org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous sparklyr spark_apply用户定义的函数错误 - sparklyr spark_apply user defined function error 使用Sparklyr中的spark_apply在Hadoop中运行系统命令 - Running a system command in Hadoop using spark_apply from sparklyr sparklyr :: spark_apply()中的名称使用`dplyr :: mutate()` - colnames in `sparklyr::spark_apply()` using `dplyr::mutate()` sparklyr:spark_apply函数在集群模式下不起作用 - sparklyr : spark_apply function is not working in cluster mode Sparklyr spark_apply 函数在相等的组上有效运行 - Sparklyr spark_apply function on equal groups to run efficiently 编写一个函数以与来自 sparklyr 的 spark_apply() 一起使用 - Writing a function to use with spark_apply() from sparklyr R中的匿名函数使用sparklyr spark_apply - Anonymous function in R using sparklyr spark_apply 在 spark_apply() 函数 sparklyr 中应用具有多个参数的 UDF - apply UDF with more than one argument in spark_apply() function sparklyr Sparklyr的spark_apply函数似乎在单个执行程序上运行,并且在中等大小的数据集上失败 - Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM