简体   繁体   中英

How to expire state of dropDuplicates in structured streaming to avoid OOM?

I want to count the unique access for each day using spark structured streaming, so I use the following code

.dropDuplicates("uuid")

and in the next day the state maintained for today should be dropped so that I can get the right count of unique access of the next day and avoid OOM. The spark document indicates using dropDuplicates with watermark, for example:

.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")

but the watermark column must be specified in dropDuplicates. In such case the uuid and timestamp will be used as a combined key to deduplicate elements with the same uuid and timestamp, which is not what I expected.

So is there a perfect solution?

After a few days effort I finally find out the way myself.

While studying the source code of watermark and dropDuplicates , I discovered that besides an eventTime column, watermark also supports window column, so we can use the following code:

.select(
    window($"timestamp", "1 day"),
    $"timestamp",
    $"uuid"
  )
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

Since all events in the same day have the same window, this will produce the same results as using only uuid to deduplicate. Hopes can help someone.

Below is the modification of the procedure proposed in Spark documentation. Trick is to manipulate event time ie put event time in buckets. Assumption is that event time is provided in milliseconds.

// removes all duplicates that are in 15 minutes tumbling window.
// doesn't remove duplicates that are in different 15 minutes windows !!!!
public static Dataset<Row> removeDuplicates(Dataset<Row> df) {
    // converts time in 15 minute buckets
    // timestamp - (timestamp % (15 * 60))
    Column bucketCol = functions.to_timestamp(
            col("event_time").divide(1000).minus((col("event_time").divide(1000)).mod(15*60)));
    df = df.withColumn("bucket", bucketCol);

    String windowDuration = "15 minutes";
    df = df.withWatermark("bucket", windowDuration)
            .dropDuplicates("uuid", "bucket");

    return df.drop("bucket");
}

I found out that window function didn't work so I chose to use window.start or window.end.

.select(
   window($"timestamp", "1 day").start,
   $"timestamp",
   $"uuid"
)
.withWatermark("window", "1 day")
.dropDuplicates("uuid", "window")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM