简体   繁体   中英

What is the purpose of ForeachWriter in Spark Structured Streaming?

Can someone explains what is the need of foreach writer on spark structured streaming?

As we get all source data in the form of dataFrame, i am not getting the use of foreachwriter.

A DataFrame is an abstract Spark concept, and does not directly map into a format that can be acted on, such as written to the console or a database.

By creating a ForeachWriter , you are taking the rows (or batches) of a DataFrame, and are defining how to open() a destination system you want to write to, how to process() that event, then finally close() the opened resources.

Using JDBC database as an example, you would establish a database session in open() , and perhaps define a PreparedStatement which maps to the data you want to add, you can then process() some generic type T to do whatever actions you want like bind some fields to the statement. And finally, when finished, you close the database connection.

In the case of writing to the console, there is nothing really to open or close, but you would need to toString each field of the DataFrame, then print it


The use cases, I feel, are well laid out in the documentation , and basically it is saying that for any system that doesn't offer you a writeStream.format("x") way of writing data, then you need to implement this class yourself to get data into your downstream systems.

Or, if you need to write to multiple destinations, you can cache the Dataframe before writing both locations such that the dataframe doesn't need recomputed, and result in inconsistent data between your destinations

In spark structured streaming df.writeStream currently is not supported for lot of stores like Jdbc, Hbase etc this is the primary use case for ForeachWriter, ForeachWriter will allow you write logic for connection creation & saving, so that you can save streaming data to any data stores. Another use case is when you want add custom logic & not just save. For more details refer doc https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch .

Incase if you are thinking about df.write(), Structured streaming dataframe data is continually updated so df.write option is ruled out as it is only for batch dataframes and not supported in streaming cases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM