简体   繁体   English

无法在 Spark (Scala) 中的数据帧上执行用户定义的函数

[英]Failed to execute user defined function on a dataframe in Spark (Scala)

I have a dataframe df like the following我有一个如下所示的数据框 df

+--------+--------------------+--------+------+
|      id|                path|somestff| hash1|
+--------+--------------------+--------+------+
|       1|/file/dirA/fileA.txt|      58| 65161|
|       2|/file/dirB/fileB.txt|      52| 65913|
|       3|/file/dirC/fileC.txt|      99|131073|
|       4|/file/dirF/fileD.txt|      46|196233|
+--------+--------------------+--------+------+

One note: The /file/dir differ.一个注意事项: /file/dir 不同。 Not all files are stored in the same directory.并非所有文件都存储在同一目录中。 In fact there a hundreds of files in various directories.事实上,在不同的目录中有数百个文件。

What I want to accomplish here is to read the file in the column path and count the records within the files and write the result of the row count into a new column of a dataframe.我想在这里完成的是读取列路径中的文件并计算文件中的记录并将行计数的结果写入数据帧的新列。

I tried the following function and udf:我尝试了以下功能和 udf:

def executeRowCount(fileCount: String): Long = {
  val rowCount = spark.read.format("csv").option("header", "false").load(fileCount).count
  rowCount
}

val execUdf = udf(executeRowCount _)

df.withColumn("row_count", execUdf (col("path"))).show()

This results in the following error这导致以下错误

org.apache.spark.SparkException: Failed to execute user defined fu
nction($anonfun$1: (string) => bigint)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
        at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:28)
        at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:25)
        ... 19 more

I tried to iterate through the column when collected like我试图在收集时遍历该列

val te = df.select("path").as[String].collect()
te.foreach(executeRowCount)

and here it works just fine, but I want to store the result within the df...在这里它工作得很好,但我想将结果存储在 df ...

I've tried several solutions, but I'm facing a dead end here.我已经尝试了几种解决方案,但我在这里面临死胡同。

That does not work as the data frames can only be created in the driver JVM but the UDF code is run in executor JVMs.这不起作用,因为数据帧只能在驱动程序 JVM 中创建,而 UDF 代码在执行程序 JVM 中运行。 What you can do is to load the CSVs into a separate data frame and enrich the data with a file name column:您可以做的是将 CSV 加载到单独的数据框中,并使用文件名列丰富数据:

val csvs = spark
 .read
 .format("csv")
 .load("/file/dir/")
 .withColumn("filename", input_file_name())

and then join the original df on filename column然后在filename df上加入原始df

I fixed this issue in the following way:我通过以下方式解决了这个问题:

val queue = df.select("path").as[String].collect()
val countResult = for (item <- queue) yield {
    val rowCount = (item, spark.read.format("csv").option("header", "false").load(item).count)
    rowCount
}

val df2 = spark.createDataFrame(countResult)

Afterwards I joined the df with df2...后来我用df2加入了df...

The problem here is as @ollik1 mentioned within the driver/worker architecture on udfs.这里的问题是 @ollik1 在 udfs 上的驱动程序/工作程序架构中提到的。 The UDF is not serializable, what I would need with the spark.read function. UDF 不可序列化,这是我使用 spark.read 函数所需要的。

What about ?关于什么 ? :

def executeRowCount = udf((fileCount: String) => {
  spark.read.format("csv").option("header", "false").load(fileCount).count
})

df.withColumn("row_count", executeRowCount(col("path"))).show()

May be something like that ?可能是这样的吗?

  sqlContext
    .read
    .format("csv")
    .load("/tmp/input/")
    .withColumn("filename", input_file_name())
    .groupBy("filename")
    .agg(count("filename").as("record_count"))
    .show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark/Scala - 无法执行用户定义的 function - Spark/Scala - Failed to execute user defined function 无法在 Spark-Scala 中执行用户定义的函数 - Failed to execute user defined function in Spark-Scala 无法使用Scala在Apache Spark中执行用户定义的函数 - Failed to execute user defined function in Apache Spark using Scala 用户按数据框分组时无法执行用户定义的功能 - Failed to execute user defined function when aggregating in a dataframe groupby user org.apache.spark.SparkException: 无法执行用户定义的函数($anonfun$last3daysMean$1: (string) =&gt; double) - org.apache.spark.SparkException: Failed to execute user defined function($anonfun$last3daysMean$1: (string) => double) 无法执行用户定义的函数(VectorAssembler - Failed to execute user defined function(VectorAssembler 在与Scala Spark中的DataFrame有关的过滤器函数中使用定义的值 - Using a defined value in a filter function concerning a DataFrame in Scala Spark 关于在 Spark Scala 中创建用户定义函数 (UDF) - About creating a User Defined Function (UDF) in Spark Scala Spark Scala:用户定义的计算中位数的聚合函数 - Spark Scala: User defined aggregate function that calculates median 我在 spark 数据框中的用户定义函数接受什么输入? - What input does my user defined function in spark dataframe take in?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM