I'm writing a dataset to CSV as follows:
df.coalesce(1)
.write()
.format("csv")
.option("header", "true")
.mode(SaveMode.Overwrite)
.save(sink);
sparkSession.streams().awaitAnyTermination();
How do I make sure, that when the streaming job gets terminated, the output is done properly?
I have the problem that the sink folder gets overwritten and is empty if I terminate too early/late.
Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.
How do I make sure, that when the streaming job gets terminated, that the output is done properly?
The way Spark Structured Streaming works is that the streaming query (job) runs continuously and "when the streaming job gets terminated, that the output is done properly" .
The question I'd ask is how a streaming query got terminated. Is this by StreamingQuery.stop
or perhaps Ctrl-C
/ kill -9
?
If a streaming query's terminated in a forceful way ( Ctrl-C
/ kill -9
), well, you get what you asked for - a partial execution with no way to be sure that an output is correct since the process (the streaming query) got shut down forcefully.
With StreamingQuery.stop
the streaming query will just terminate gracefully and write out all it would at the time.
I have the problem, that the sink folder gets overwritten and that the folder is empty if I terminate too early/late.
If you terminate too early/late, what else would you expect since the streaming query could not finish its work. You should stop
it gracefully and you get the expected output.
Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.
That's an interesting observation which requires further exploration.
If there are no messages to be processed, no batch would be triggered so no jobs so no "overwrites the result with an empty file." (as no task would get executed).
Firstly, I see that you have not used writeStream
I am not quite sure how is your job a streaming job. Now, answering your question 1, you can use StreamingQueryListener
to monitor the streaming query's progress. Have another StreamingQuery to read from the output location. Monitor it as well. Once you have the files in the output location, use the query name and input record count in the StreamingQueryListener
to gracefully stop
any query. awaitAnyTermination
should stop your spark application. Following code can be of help.
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
Answering your 2nd question, I too, do not think that this is possible as mentioned in the answer by Jacek.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.