繁体   English   中英

从 avro 文件中获取 spark dataframe 列中每一行的数据

[英]Getting data from an avro file for every row in a spark dataframe column

我正在尝试处理 dataframe 中的一列,并从与每个条目对应的 avro 文件中检索一个指标。

基本上,我想做以下事情:

  1. 读取 Path 列的每一行,这是 avro 文件的路径
  2. 以 dataframe 的形式读入 avro 文件并获取准确度指标,其形式为 Struct
  3. 创建一个名为 Accuracy 的新列,其中包含准确度指标

这也可以被视为应用spark.read.format("com.databricks.spark.avro").load(avro_path)但对于Path列中的每一行。 这是我的输入 dataframe:

+----------+-----+--------------------------+
|timestamp |Model|         Path             |
+----------+-----+--------------------------+
|11:02     |Vgg  |projects/Vgg/results.avro |
|18:31     |Dnet |projects/Dnet/results.avro|
|15:54     |Rnet |projects/Rnet/results.avro|
|12:19     |ViT  |projects/ViT/results.avro |
+----------+-----+--------------------------+

我希望这是我的 output dataframe:

+----------+-----+--------------------------+-----------+
|timestamp |Model|         Path             | Accuracy  |
+----------+-----+--------------------------+-----------+
|11:02     |Vgg  |projects/Vgg/results.avro |   0.72    | 
|18:31     |Dnet |projects/Dnet/results.avro|   0.78    |
|15:54     |Rnet |projects/Rnet/results.avro|   0.75    |
|12:19     |ViT  |projects/ViT/results.avro |   0.82    |
+----------+-----+--------------------------+-----------+

我试过使用 udf,但我猜你不能在 udf 中加载数据帧。

val get_auc: (String => String) = (avro_path: String) => {
    
     val auc_avro_file = spark.read.format("com.databricks.spark.avro").load(avro_path)
     val auc = auc_avro_file.select("metrics.Accuracy").first.toString
     auc

}
val auc_udf = udf(get_auc)
val auc_path = models_df.withColumn("Accuracy", auc_udf(col("avro_path")))

错误:

Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string) => string)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Unknown Source)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:254)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
  ... 3 more
Caused by: java.lang.NullPointerException
  at $anonfun$1.apply(<console>:49)
  at $anonfun$1.apply(<console>:46)
  ... 20 more

我还有其他方法可以做到这一点吗? 喜欢使用 map 或 for 循环?

编辑:根据以下答案之一尝试使用 input_file_name :

val paths_col = auc_path.select($"Path")  
val avro_paths = paths_col.withColumn("filename", input_file_name()) 

但这给了我一个 url 到新列中一个完全不同的 avro 文件,这不是我想要的。

+----------+-----+--------------------------+------------------------------------+
|timestamp |Model|         Path             |          different_output_Path     |            
+----------+-----+--------------------------+------------------------------------+
|11:02     |Vgg  |projects/Vgg/results.avro |projects/models/all_model_runs.avro |
|18:31     |Dnet |projects/Dnet/results.avro|projects/models/all_model_runs.avro |
|15:54     |Rnet |projects/Rnet/results.avro||projects/models/all_model_runs.avro|
|12:19     |ViT  |projects/ViT/results.avro |projects/models/all_model_runs.avro |
+----------+-----+--------------------------------------------------------------+

我如何仍然获得每个 avro 文件中的metrics.Accuracy部分?

将 avro 文件读取为 dataframe 并存储行的路径:

...
val dfWithCol = df.withColumn("filename",input_file_name())
...

然后适当地加入。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM