简体   繁体   English

Spark结构化流mapGroupWithState输出到实木复合地板

[英]Spark Structured Streaming mapGroupWithState output to parquet

I have a Spark Structured Streaming application which doing event merges with mapGroupWithState . 我有一个Spark结构化流应用程序,该应用程序将事件与mapGroupWithState合并。 It works perfect with console sink, but in production I need to write data in parquet format, but I'm confusing, because mapGroupWithState requires Update mode and parquet output requires Append mode. 它可以与控制台接收器完美配合,但是在生产中我需要以镶木地板格式编写数据,但是令人困惑,因为mapGroupWithState需要Update模式,而镶木地板输出需要Append模式。 Is there any solution here ? 这里有什么解决办法吗? Or can we use foreach sink someway for this case. 或者我们可以在这种情况下使用foreach接收器。

val query: Dataset[BidData] = bidStream
    .groupByKey(_.auction_id)
    .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(updateBidState)
    .flatMap(b => b)

query.writeStream
    .outputMode(OutputMode.Update())
    .format("parquet")
    .option("path", appConfig.s3Output)
    .option("checkpointLocation", appConfig.checkpoint)

Alternatively you can use flatMapGroupsWithState with append mode (mapGroupWithState is a special case of flatMapGroupsWithState). 或者,您可以在附加模式下使用flatMapGroupsWithState(mapGroupWithState是flatMapGroupsWithState的特例)。

But if you do so, your parquet file will contain all the historical bid states within it. 但是,如果这样做,您的实木复合地板文件将包含其中的所有历史出价状态。 If you need to get the last bid state you will have to write queries (I assume you will use SparkSQL or Hive) to return the last bid state. 如果需要获取最后的出价状态,则必须编写查询(假设您将使用SparkSQL或Hive)以返回最后的出价状态。

Refer: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.html 请参阅: https : //jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM