无法在 Spark (Scala) 中的数据帧上执行用户定义的函数

Question

I have a dataframe df like the following我有一个如下所示的数据框 df

+--------+--------------------+--------+------+
|      id|                path|somestff| hash1|
+--------+--------------------+--------+------+
|       1|/file/dirA/fileA.txt|      58| 65161|
|       2|/file/dirB/fileB.txt|      52| 65913|
|       3|/file/dirC/fileC.txt|      99|131073|
|       4|/file/dirF/fileD.txt|      46|196233|
+--------+--------------------+--------+------+

One note: The /file/dir differ.一个注意事项： /file/dir 不同。 Not all files are stored in the same directory.并非所有文件都存储在同一目录中。 In fact there a hundreds of files in various directories.事实上，在不同的目录中有数百个文件。

What I want to accomplish here is to read the file in the column path and count the records within the files and write the result of the row count into a new column of a dataframe.我想在这里完成的是读取列路径中的文件并计算文件中的记录并将行计数的结果写入数据帧的新列。

I tried the following function and udf:我尝试了以下功能和 udf：

def executeRowCount(fileCount: String): Long = {
  val rowCount = spark.read.format("csv").option("header", "false").load(fileCount).count
  rowCount
}

val execUdf = udf(executeRowCount _)

df.withColumn("row_count", execUdf (col("path"))).show()

This results in the following error这导致以下错误

org.apache.spark.SparkException: Failed to execute user defined fu
nction($anonfun$1: (string) => bigint)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
        at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:28)
        at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:25)
        ... 19 more

I tried to iterate through the column when collected like我试图在收集时遍历该列

val te = df.select("path").as[String].collect()
te.foreach(executeRowCount)

and here it works just fine, but I want to store the result within the df...在这里它工作得很好，但我想将结果存储在 df ...

I've tried several solutions, but I'm facing a dead end here.我已经尝试了几种解决方案，但我在这里面临死胡同。

Answer 1

That does not work as the data frames can only be created in the driver JVM but the UDF code is run in executor JVMs.这不起作用，因为数据帧只能在驱动程序 JVM 中创建，而 UDF 代码在执行程序 JVM 中运行。 What you can do is to load the CSVs into a separate data frame and enrich the data with a file name column:您可以做的是将 CSV 加载到单独的数据框中，并使用文件名列丰富数据：

val csvs = spark
 .read
 .format("csv")
 .load("/file/dir/")
 .withColumn("filename", input_file_name())

and then join the original df on filename column然后在filename df上加入原始df

Answer 2

I fixed this issue in the following way:我通过以下方式解决了这个问题：

val queue = df.select("path").as[String].collect()
val countResult = for (item <- queue) yield {
    val rowCount = (item, spark.read.format("csv").option("header", "false").load(item).count)
    rowCount
}

val df2 = spark.createDataFrame(countResult)

Afterwards I joined the df with df2...后来我用df2加入了df...

The problem here is as @ollik1 mentioned within the driver/worker architecture on udfs.这里的问题是 @ollik1 在 udfs 上的驱动程序/工作程序架构中提到的。 The UDF is not serializable, what I would need with the spark.read function. UDF 不可序列化，这是我使用 spark.read 函数所需要的。

Answer 3

What about ?关于什么？ : ：

def executeRowCount = udf((fileCount: String) => {
  spark.read.format("csv").option("header", "false").load(fileCount).count
})

df.withColumn("row_count", executeRowCount(col("path"))).show()

Answer 4

May be something like that ?可能是这样的吗？

  sqlContext
    .read
    .format("csv")
    .load("/tmp/input/")
    .withColumn("filename", input_file_name())
    .groupBy("filename")
    .agg(count("filename").as("record_count"))
    .show()

无法在 Spark (Scala) 中的数据帧上执行用户定义的函数

问题描述

4 个解决方案

解决方案1
2 2019-04-01 14:51:01

解决方案2
1 已采纳 2019-04-04 15:51:08

解决方案3
0 2019-04-01 14:51:12

解决方案4
0 2019-04-01 18:30:22

无法在 Spark (Scala) 中的数据帧上执行用户定义的函数

问题描述

4 个解决方案

解决方案1 2 2019-04-01 14:51:01

解决方案2 1 已采纳 2019-04-04 15:51:08

解决方案3 0 2019-04-01 14:51:12

解决方案4 0 2019-04-01 18:30:22

解决方案1
2 2019-04-01 14:51:01

解决方案2
1 已采纳 2019-04-04 15:51:08

解决方案3
0 2019-04-01 14:51:12

解决方案4
0 2019-04-01 18:30:22