如何使用 scala 中的火花流将索引列 append 到火花数据帧？

Question

I am using something like this:我正在使用这样的东西：

df.withColumn("idx", monotonically_increasing_id())

But I get an exception as it is NOT SUPPORTED:但我得到一个例外，因为它不受支持：

Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression(s): monotonically_increasing_id() is not supported with streaming DataFrames/Datasets;;

at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:143)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:250)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:316)

Any ideas how to add an index or row number column to spark streaming dataframe in scala?任何想法如何添加索引或行号列以在 scala 中触发流式传输dataframe？

Full stacktrace: https://justpaste.it/5bdqr完整堆栈跟踪： https://justpaste.it/5bdqr

Answer 1

There are a few operations that cannot exists anywhere in a streaming plan of Spark Streaming, unfortunately including monotonically_increasing_id() .在 Spark Streaming 的流式传输计划中，有一些操作在任何地方都不存在，不幸的是，包括monotonically_increasing_id() 。 Double check for this fact transformed1 is failing with the error as in your question, here is a reference on this check in Spark source code:仔细检查这个事实， transformed1失败并出现您的问题中的错误，这是 Spark 源代码中有关此检查的参考：

import org.apache.spark.sql.functions._ 

val df = Seq(("one", 1), ("two", 2)).toDF("foo", "bar")
val schema = df.schema
df.write.parquet("/tmp/out")
val input = spark.readStream.format("parquet").schema(schema).load("/tmp/out")

val transformed1 = input.withColumn("id", monotonically_increasing_id())
transformed1.writeStream.format("parquet").option("format", "append") .option("path", "/tmp/out2") .option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()

import org.apache.spark.sql.expressions.Window
val windowSpecRowNum = Window.partitionBy("foo").orderBy("foo")

val transformed2 = input.withColumn("row_num", row_number.over(windowSpecRowNum))
transformed2.writeStream.format("parquet").option("format", "append").option("path", "/tmp/out2").option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()

Also I tried to add indexing with Window over a column in DF - transformed2 in the snapshot above - it also failed, but with a different error ):我还尝试使用Window在 DF 中的列上添加索引 - 上面快照中的transformed2 - 它也失败了，但出现了不同的错误）：

"Non-time-based windows are not supported on streaming DataFrames/Datasets" “流数据帧/数据集不支持非基于时间的 windows”

All unsupported operator checks for Spark Streaming you can find here - seems like the traditional ways of adding an index column in Spark Batch don't work in Spark Streaming.您可以在此处找到所有不受支持的 Spark Streaming 运算符检查 - 似乎在 Spark Batch 中添加索引列的传统方法在 Spark Streaming 中不起作用。

如何使用 scala 中的火花流将索引列 append 到火花数据帧？

问题描述

1 个解决方案

解决方案1
0 2021-01-12 23:04:24

如何使用 scala 中的火花流将索引列 append 到火花数据帧？

问题描述

1 个解决方案

解决方案1 0 2021-01-12 23:04:24

解决方案1
0 2021-01-12 23:04:24