[英]Sparklyr cannot reference table in spark_apply
I want to use spark_apply to iterate through a number of data processes for feature generation. 我想使用spark_apply遍历用于特征生成的许多数据过程。 To do that I need to reference tables already loaded into spark but get the following error: 为此,我需要引用已经加载到spark中的表,但是会出现以下错误:
ERROR sparklyr: RScript (3076) terminated unexpectedly: object 'ref_table' not found 错误sparklyr:RScript(3076)意外终止:找不到对象'ref_table'
A reproducible example: 一个可重现的示例:
ref_table <- sdf_along(sc, 10)
apply_table <- sdf_along(sc, 10)
spark_apply(x = apply_table,
f = function(x) {
c(x, ref_table)
})
I know I can reference libraries inside the function, but not sure how to call up the data. 我知道我可以在函数内部引用库,但不确定如何调用数据。 I am running a local spark cluster through rstudio. 我正在通过rstudio运行本地Spark集群。
Unfortunately the failure is to be expected here. 不幸的是,这里的失败是可以预期的。
Apache Spark, and because of that platforms based on it, doesn't support nested transformations like this one. 由于基于Apache Spark的平台,Apache Spark不支持这种嵌套转换。 You cannot use nested transformation, distributed objects or Spark context ( spark_connection
in case of sparklyr
) from a worker code. 您不能从工作程序代码中使用嵌套转换,分布式对象或Spark上下文(在spark_connection
情况下为sparklyr
)。
For a detailed explanation please check my answer to Is there a reason not to use SparkContext.getOrCreate when writing a spark job? 有关详细说明,请检查我的回答。 是否有理由在编写Spark作业时不使用SparkContext.getOrCreate? . 。
Your question doesn't give enough context to determine the best course of action here, but in general there two possible solutions: 您的问题并没有提供足够的背景信息来确定最佳的解决方案,但总的来说,有两种可能的解决方案:
join
or the Cartesian product (Spark's crossJoin
). 重新制定您的问题作为join
或笛卡尔乘积(斯巴克的crossJoin
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.