[英]how to run map transformation in a structured streaming job in pyspark
I am trying to setup a structured streaming job with a map() transformation that make REST API calls.我正在尝试使用进行 REST API 调用的 map() 转换来设置结构化流作业。 Here are the details:以下是详细信息:
(1)
df=spark.readStream.format('delta') \
.option("maxFilesPerTrigger", 1000) \
.load(f'{file_location}')
(2)
respData=df.select("resource", "payload").rdd.map(lambda row: put_resource(row[0], row[1])).collect()
respDf=spark.createDataFrame(respData, ["resource", "status_code", "reason"])
(3)
respDf.writeStream \
.trigger(once=True) \
.outputMode("append") \
.format("delta") \
.option("path", f'{file_location}/Response') \
.option("checkpointLocation", f'{file_location}/Response/Checkpoints') \
.start()
However, I got an error: Queries with streaming sources must be executed with writeStream.start() on step (2).但是,我得到一个错误:必须在步骤 (2) 上使用 writeStream.start() 执行带有流源的查询。
Any help will be appreciated.任何帮助将不胜感激。 Thank you.谢谢你。
you have to execute your stream on df also meaning df.writeStream.start()..你必须在 df 上执行你的 stream 也意味着 df.writeStream.start()..
there is a similar thread here:这里有一个类似的线程:
Queries with streaming sources must be executed with writeStream.start(); 必须使用 writeStream.start() 执行带有流源的查询;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.