[英]Getting data from an avro file for every row in a spark dataframe column
我正在尝试处理 dataframe 中的一列,并从与每个条目对应的 avro 文件中检索一个指标。
基本上,我想做以下事情:
这也可以被视为应用spark.read.format("com.databricks.spark.avro").load(avro_path)
但对于Path
列中的每一行。 这是我的输入 dataframe:
+----------+-----+--------------------------+
|timestamp |Model| Path |
+----------+-----+--------------------------+
|11:02 |Vgg |projects/Vgg/results.avro |
|18:31 |Dnet |projects/Dnet/results.avro|
|15:54 |Rnet |projects/Rnet/results.avro|
|12:19 |ViT |projects/ViT/results.avro |
+----------+-----+--------------------------+
我希望这是我的 output dataframe:
+----------+-----+--------------------------+-----------+
|timestamp |Model| Path | Accuracy |
+----------+-----+--------------------------+-----------+
|11:02 |Vgg |projects/Vgg/results.avro | 0.72 |
|18:31 |Dnet |projects/Dnet/results.avro| 0.78 |
|15:54 |Rnet |projects/Rnet/results.avro| 0.75 |
|12:19 |ViT |projects/ViT/results.avro | 0.82 |
+----------+-----+--------------------------+-----------+
我试过使用 udf,但我猜你不能在 udf 中加载数据帧。
val get_auc: (String => String) = (avro_path: String) => {
val auc_avro_file = spark.read.format("com.databricks.spark.avro").load(avro_path)
val auc = auc_avro_file.select("metrics.Accuracy").first.toString
auc
}
val auc_udf = udf(get_auc)
val auc_path = models_df.withColumn("Accuracy", auc_udf(col("avro_path")))
错误:
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string) => string)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:254)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
... 3 more
Caused by: java.lang.NullPointerException
at $anonfun$1.apply(<console>:49)
at $anonfun$1.apply(<console>:46)
... 20 more
我还有其他方法可以做到这一点吗? 喜欢使用 map 或 for 循环?
编辑:根据以下答案之一尝试使用 input_file_name :
val paths_col = auc_path.select($"Path")
val avro_paths = paths_col.withColumn("filename", input_file_name())
但这给了我一个 url 到新列中一个完全不同的 avro 文件,这不是我想要的。
+----------+-----+--------------------------+------------------------------------+
|timestamp |Model| Path | different_output_Path |
+----------+-----+--------------------------+------------------------------------+
|11:02 |Vgg |projects/Vgg/results.avro |projects/models/all_model_runs.avro |
|18:31 |Dnet |projects/Dnet/results.avro|projects/models/all_model_runs.avro |
|15:54 |Rnet |projects/Rnet/results.avro||projects/models/all_model_runs.avro|
|12:19 |ViT |projects/ViT/results.avro |projects/models/all_model_runs.avro |
+----------+-----+--------------------------------------------------------------+
我如何仍然获得每个 avro 文件中的metrics.Accuracy
部分?
将 avro 文件读取为 dataframe 并存储行的路径:
...
val dfWithCol = df.withColumn("filename",input_file_name())
...
然后适当地加入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.