Sparklyr的spark_apply函数似乎在单个执行程序上运行，并且在中等大小的数据集上失败

Question

I am trying to use spark_apply to run the R function below on a Spark table. 我正在尝试使用spark_apply在Spark表上运行以下R函数。 This works fine if my input table is small (eg 5,000 rows), but after ~30 mins throws an error when the table is moderately large (eg 5,000,000 rows): sparklyr worker rscript failure, check worker logs for details 如果我的输入表很小（例如5,000行），这会很好地工作，但是在表中等大小（例如5,000,000行）的sparklyr worker rscript failure, check worker logs for details约30分钟后会引发错误： sparklyr worker rscript failure, check worker logs for details

Looking at the Spark UI shows that there is only ever a single task being created, and a single executor being applied to this task. 查看Spark UI会发现，仅创建了一个任务，并且将一个执行程序应用于此任务。

Can anyone give advice on why this function is failing for 5 million row dataset? 谁能提供建议，说明为什么该功能对于500万行数据集失败？ Could the problem be that a single executor is being made to do all the work, and failing? 问题是否可能是由一个执行者来完成所有工作而失败了？

# Create data and copy to Spark
testdf <- data.frame(string_id=rep(letters[1:5], times=1000), # 5000 row table
                 string_categories=rep(c("", "1", "2 3", "4 5 6", "7"), times=1000))
testtbl <- sdf_copy_to(sc, testdf, overwrite=TRUE, repartition=100L, memory=TRUE)

# Write function to return dataframe with strings split out
myFunction <- function(inputdf){
  inputdf$string_categories <- as.character(inputdf$string_categories)
  inputdf$string_categories=with(inputdf, ifelse(string_categories=="", "blank", string_categories))
  stringCategoriesList <- strsplit(inputdf$string_categories, ' ')
  outDF <- data.frame(string_id=rep(inputdf$string_id, times=unlist(lapply(stringCategoriesList, length))),
                  string_categories=unlist(stringCategoriesList))
 return(outDF)
}

# Use spark_apply to run function in Spark
outtbl <- testtbl %>%
  spark_apply(myFunction,
          names=c('string_id', 'string_categories'))
outtbl

Answer 1

The sparklyr worker rscript failure, check worker logs for details error is written by the driver node and points out that the actual error needs to be found in the worker logs. sparklyr worker rscript failure, check worker logs for details错误是由驱动程序节点写入的，并指出需要在工作程序日志中找到实际的错误。 Usually, the worker logs can be accessed by opening stdout from the executor's tab in the Spark UI; 通常，可以通过在Spark UI的“执行者”选项卡中打开stdout来访问工作日志。 the logs should contain RScript: entries describing what the executor is processing and the specific of the error. 日志中应包含RScript:描述执行程序正在处理的内容以及错误的具体内容的条目。
Regarding the single task being created, when columns are not specified with types in spark_apply() , it needs to compute a subset of the result to guess the column types, to avoid this, pass explicit column types as follows: 关于正在创建的单个任务，当未在spark_apply()中用类型指定columns时，它需要计算结果的子集来猜测列类型，为避免这种情况，请按如下所示传递显式列类型：
outtbl <- testtbl %>% spark_apply( myFunction, columns=list( string_id = "character", string_categories = "character"))
If using sparklyr 0.6.3 , update to sparklyr 0.6.4 or devtools::install_github("rstudio/sparklyr") , since sparklyr 0.6.3 contains an incorrect wait time in some edge cases where package distribution is enabled and more than one executor runs in each node. 如果使用sparklyr 0.6.3 ，则更新为sparklyr 0.6.4或devtools::install_github("rstudio/sparklyr") ，因为在启用包分发且有多个执行程序的某些sparklyr 0.6.3情况下， sparklyr 0.6.3包含错误的等待时间。在每个节点中运行。
Under high load, it is common to run out of memory. 在高负载下，通常会耗尽内存。 Increasing the number of partitions could resolve this issue since it would reduce the total memory required to process this dataset. 增加分区数可以解决此问题，因为这会减少处理此数据集所需的总内存。 Try running this as: 尝试运行为：
testtbl %>% sdf_repartition(1000) %>% spark_apply(myFunction, names=c('string_id', 'string_categories'))
It could also be the case that the function throws an exception for some of the partitions due to logic in the function, you could see if this is the case by using tryCatch() to ignore the errors and then find which are the missing values and why the function would fail for those values. 也可能是由于函数中的逻辑而导致函数对某些分区抛出异常的情况，您可以使用tryCatch()忽略错误，然后查找丢失的值并为什么函数对于这些值将失败。 I would start with something like: 我将从以下内容开始：
myFunction <- function(inputdf){ tryCatch({ inputdf$string_categories <- as.character(inputdf$string_categories) inputdf$string_categories=with(inputdf, ifelse(string_categories=="", "blank", string_categories)) stringCategoriesList <- strsplit(inputdf$string_categories, ' ') outDF <- data.frame(string_id=rep(inputdf$string_id, times=unlist(lapply(stringCategoriesList, length))), string_categories=unlist(stringCategoriesList)) return(outDF) }, error = function(e) { return( data.frame(string_id = c(0), string_categories = c("error")) ) }) }

Sparklyr的spark_apply函数似乎在单个执行程序上运行，并且在中等大小的数据集上失败

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-09-25 16:58:34

Sparklyr的spark_apply函数似乎在单个执行程序上运行，并且在中等大小的数据集上失败

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-09-25 16:58:34

解决方案1
5 已采纳 2017-09-25 16:58:34