如何使结构化流中的dropDuplicates状态到期以避免OOM？

Question

I want to count the unique access for each day using spark structured streaming, so I use the following code 我想使用spark结构流来计算每天的唯一访问权限，因此我使用以下代码

.dropDuplicates("uuid")

and in the next day the state maintained for today should be dropped so that I can get the right count of unique access of the next day and avoid OOM. 并且在第二天应该放弃今天维持的状态，以便我可以获得第二天的唯一访问权限并避免OOM。 The spark document indicates using dropDuplicates with watermark, for example: spark文件表示使用带水印的dropDuplicates，例如：

.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")

but the watermark column must be specified in dropDuplicates. 但必须在dropDuplicates中指定水印列。 In such case the uuid and timestamp will be used as a combined key to deduplicate elements with the same uuid and timestamp, which is not what I expected. 在这种情况下，uuid和timestamp将被用作组合键，以使用相同的uuid和timestamp对元素进行重复数据删除，这不是我所期望的。

So is there a perfect solution? 那么有一个完美的解决方案吗？

Answer 1

After a few days effort I finally find out the way myself. 经过几天的努力，我终于找到了自己的方式。

While studying the source code of watermark and dropDuplicates , I discovered that besides an eventTime column, watermark also supports window column, so we can use the following code: 在研究水印和dropDuplicates的源代码时，我发现除了eventTime列之外，水印还支持window列，因此我们可以使用以下代码：

.select(
    window($"timestamp", "1 day"),
    $"timestamp",
    $"uuid"
  )
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

Since all events in the same day have the same window, this will produce the same results as using only uuid to deduplicate. 由于同一天的所有事件都具有相同的窗口，因此这将产生与仅使用uuid进行重复数据删除相同的结果。 Hopes can help someone. 希望可以帮助某人。

Answer 2

Below is the modification of the procedure proposed in Spark documentation. 以下是Spark文档中提出的过程的修改。 Trick is to manipulate event time ie put event time in buckets. 诀窍是操纵事件时间，即将事件时间放入桶中。 Assumption is that event time is provided in milliseconds. 假设事件时间以毫秒为单位。

// removes all duplicates that are in 15 minutes tumbling window.
// doesn't remove duplicates that are in different 15 minutes windows !!!!
public static Dataset<Row> removeDuplicates(Dataset<Row> df) {
    // converts time in 15 minute buckets
    // timestamp - (timestamp % (15 * 60))
    Column bucketCol = functions.to_timestamp(
            col("event_time").divide(1000).minus((col("event_time").divide(1000)).mod(15*60)));
    df = df.withColumn("bucket", bucketCol);

    String windowDuration = "15 minutes";
    df = df.withWatermark("bucket", windowDuration)
            .dropDuplicates("uuid", "bucket");

    return df.drop("bucket");
}

Answer 3

I found out that window function didn't work so I chose to use window.start or window.end. 我发现窗口功能不起作用所以我选择使用window.start或window.end。

.select(
   window($"timestamp", "1 day").start,
   $"timestamp",
   $"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

如何使结构化流中的dropDuplicates状态到期以避免OOM？

问题描述

3 个解决方案

解决方案1
6 已采纳 2017-08-07 09:31:19

解决方案2
0 2018-11-14 21:07:45

解决方案3
0 2018-11-18 02:58:54

如何使结构化流中的dropDuplicates状态到期以避免OOM？

问题描述

3 个解决方案

解决方案1 6 已采纳 2017-08-07 09:31:19

解决方案2 0 2018-11-14 21:07:45

解决方案3 0 2018-11-18 02:58:54

解决方案1
6 已采纳 2017-08-07 09:31:19

解决方案2
0 2018-11-14 21:07:45

解决方案3
0 2018-11-18 02:58:54