How to make sure that write csv is complete?

Question

I'm writing a dataset to CSV as follows:

df.coalesce(1)
  .write()
  .format("csv")
  .option("header", "true")
  .mode(SaveMode.Overwrite)
  .save(sink);

sparkSession.streams().awaitAnyTermination();

How do I make sure, that when the streaming job gets terminated, the output is done properly?

I have the problem that the sink folder gets overwritten and is empty if I terminate too early/late.

Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.

Answer 1

How do I make sure, that when the streaming job gets terminated, that the output is done properly?

The way Spark Structured Streaming works is that the streaming query (job) runs continuously and "when the streaming job gets terminated, that the output is done properly" .

The question I'd ask is how a streaming query got terminated. Is this by StreamingQuery.stop or perhaps Ctrl-C / kill -9 ?

If a streaming query's terminated in a forceful way ( Ctrl-C / kill -9 ), well, you get what you asked for - a partial execution with no way to be sure that an output is correct since the process (the streaming query) got shut down forcefully.

With StreamingQuery.stop the streaming query will just terminate gracefully and write out all it would at the time.

I have the problem, that the sink folder gets overwritten and that the folder is empty if I terminate too early/late.

If you terminate too early/late, what else would you expect since the streaming query could not finish its work. You should stop it gracefully and you get the expected output.

Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.

That's an interesting observation which requires further exploration.

If there are no messages to be processed, no batch would be triggered so no jobs so no "overwrites the result with an empty file." (as no task would get executed).

Answer 2

Firstly, I see that you have not used writeStream I am not quite sure how is your job a streaming job. Now, answering your question 1, you can use StreamingQueryListener to monitor the streaming query's progress. Have another StreamingQuery to read from the output location. Monitor it as well. Once you have the files in the output location, use the query name and input record count in the StreamingQueryListener to gracefully stop any query. awaitAnyTermination should stop your spark application. Following code can be of help.

spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
  //logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
  synchronized {
    if(event.progress.name.equalsIgnoreCase("QueryName"))
    {
    recordsReadCount = recordsReadCount + event.progress.numInputRows
    //Logger messages to show continuous progress
    }
  }
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
  synchronized {
    //logger message to show the reason of termination.
  }
}

})

Answering your 2nd question, I too, do not think that this is possible as mentioned in the answer by Jacek.

How to make sure that write csv is complete?

Question

2 answers

solution1
2 ACCPTED 2019-07-19 13:56:58

solution2
1 2019-07-20 13:59:07

How to make sure that write csv is complete?

Question

2 answers

solution1 2 ACCPTED 2019-07-19 13:56:58

solution2 1 2019-07-20 13:59:07

solution1
2 ACCPTED 2019-07-19 13:56:58

solution2
1 2019-07-20 13:59:07