[英]How to append an index column to a spark data frame using spark streaming in scala?
I am using something like this:我正在使用这样的东西:
df.withColumn("idx", monotonically_increasing_id())
But I get an exception as it is NOT SUPPORTED:但我得到一个例外,因为它不受支持:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression(s): monotonically_increasing_id() is not supported with streaming DataFrames/Datasets;;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:143)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:250)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:316)
Any ideas how to add an index or row number column to spark streaming dataframe in scala?任何想法如何添加索引或行号列以在 scala 中触发流式传输dataframe?
Full stacktrace: https://justpaste.it/5bdqr完整堆栈跟踪: https://justpaste.it/5bdqr
There are a few operations that cannot exists anywhere in a streaming plan of Spark Streaming, unfortunately including monotonically_increasing_id()
.在 Spark Streaming 的流式传输计划中,有一些操作在任何地方都不存在,不幸的是,包括monotonically_increasing_id()
。 Double check for this fact transformed1
is failing with the error as in your question, here is a reference on this check in Spark source code:仔细检查这个事实, transformed1
失败并出现您的问题中的错误, 这是 Spark 源代码中有关此检查的参考:
import org.apache.spark.sql.functions._
val df = Seq(("one", 1), ("two", 2)).toDF("foo", "bar")
val schema = df.schema
df.write.parquet("/tmp/out")
val input = spark.readStream.format("parquet").schema(schema).load("/tmp/out")
val transformed1 = input.withColumn("id", monotonically_increasing_id())
transformed1.writeStream.format("parquet").option("format", "append") .option("path", "/tmp/out2") .option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
import org.apache.spark.sql.expressions.Window
val windowSpecRowNum = Window.partitionBy("foo").orderBy("foo")
val transformed2 = input.withColumn("row_num", row_number.over(windowSpecRowNum))
transformed2.writeStream.format("parquet").option("format", "append").option("path", "/tmp/out2").option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
Also I tried to add indexing with Window
over a column in DF - transformed2
in the snapshot above - it also failed, but with a different error ):我还尝试使用Window
在 DF 中的列上添加索引 - 上面快照中的transformed2
- 它也失败了,但出现了不同的错误):
"Non-time-based windows are not supported on streaming DataFrames/Datasets" “流数据帧/数据集不支持非基于时间的 windows”
All unsupported operator checks for Spark Streaming you can find here - seems like the traditional ways of adding an index column in Spark Batch don't work in Spark Streaming.您可以在此处找到所有不受支持的 Spark Streaming 运算符检查 - 似乎在 Spark Batch 中添加索引列的传统方法在 Spark Streaming 中不起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.