简体   繁体   English

如何使用 scala 中的火花流将索引列 append 到火花数据帧?

[英]How to append an index column to a spark data frame using spark streaming in scala?

I am using something like this:我正在使用这样的东西:

df.withColumn("idx", monotonically_increasing_id())

But I get an exception as it is NOT SUPPORTED:但我得到一个例外,因为它不受支持:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression(s): monotonically_increasing_id() is not supported with streaming DataFrames/Datasets;;

at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:143)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:250)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:316)

Any ideas how to add an index or row number column to spark streaming dataframe in scala?任何想法如何添加索引或行号列以在 scala 中触发流式传输dataframe?

Full stacktrace: https://justpaste.it/5bdqr完整堆栈跟踪: https://justpaste.it/5bdqr

There are a few operations that cannot exists anywhere in a streaming plan of Spark Streaming, unfortunately including monotonically_increasing_id() .在 Spark Streaming 的流式传输计划中,有一些操作在任何地方都不存在,不幸的是,包括monotonically_increasing_id() Double check for this fact transformed1 is failing with the error as in your question, here is a reference on this check in Spark source code:仔细检查这个事实, transformed1失败并出现您的问题中的错误, 是 Spark 源代码中有关此检查的参考:

import org.apache.spark.sql.functions._ 

val df = Seq(("one", 1), ("two", 2)).toDF("foo", "bar")
val schema = df.schema
df.write.parquet("/tmp/out")
val input = spark.readStream.format("parquet").schema(schema).load("/tmp/out")

val transformed1 = input.withColumn("id", monotonically_increasing_id())
transformed1.writeStream.format("parquet").option("format", "append") .option("path", "/tmp/out2") .option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()

import org.apache.spark.sql.expressions.Window
val windowSpecRowNum = Window.partitionBy("foo").orderBy("foo")

val transformed2 = input.withColumn("row_num", row_number.over(windowSpecRowNum))
transformed2.writeStream.format("parquet").option("format", "append").option("path", "/tmp/out2").option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()

Also I tried to add indexing with Window over a column in DF - transformed2 in the snapshot above - it also failed, but with a different error ):我还尝试使用Window在 DF 中的列上添加索引 - 上面快照中的transformed2 - 它也失败了,但出现了不同的错误):

"Non-time-based windows are not supported on streaming DataFrames/Datasets" “流数据帧/数据集不支持非基于时间的 windows”

All unsupported operator checks for Spark Streaming you can find here - seems like the traditional ways of adding an index column in Spark Batch don't work in Spark Streaming.您可以在此处找到所有不受支持的 Spark Streaming 运算符检查 - 似乎在 Spark Batch 中添加索引列的传统方法在 Spark Streaming 中不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM