结构化流式自定义重复数据删除

Question

I have a streaming data coming in from kafka into dataFrame. 我有一个流数据从kafka进入dataFrame。 I want to remove duplicates based in Id and keep the latest records based on timestamp. 我想删除基于ID的重复项，并根据时间戳保留最新记录。

Sample data is like this : 样本数据如下：

Id  Name    count   timestamp
1   Vikas   20      2018-09-19T10:10:10
2   Vijay   50      2018-09-19T10:10:20
3   Vilas   30      2018-09-19T10:10:30
4   Vishal  10      2018-09-19T10:10:40
1   Vikas   50      2018-09-19T10:10:50
4   Vishal  40      2018-09-19T10:11:00
1   Vikas   10      2018-09-19T10:11:10
3   Vilas   20      2018-09-19T10:11:20

The output that I am expecting would be : 我期望的输出将是：

Id  Name    count   timestamp
1   Vikas   10      2018-09-19T10:11:10
2   Vijay   50      2018-09-19T10:10:20
3   Vilas   20      2018-09-19T10:11:20
4   Vishal  40      2018-09-19T10:11:00

Older duplicates are removed and only the recent records are kept based on the timestamp field. 较旧的重复项将被删除，并且根据timestamp字段仅保留最近的记录。

I am using watermarking for timestamp field. 我正在为时间戳字段使用水印。 I have tried using "df.removeDuplicate" but it keeps older records intact and anything new gets discarded. 我尝试使用“ df.removeDuplicate”，但它会保留较旧的记录，并且不会丢弃任何新记录。

Current code is as follows : 当前代码如下：

df = df.withWatermark("timestamp", "1 Day").dropDuplicates("Id", "timestamp")

How can we implement custom dedup method so that we can keep latest record as unique record? 我们如何实现自定义dedup方法，以便将最新记录保留为唯一记录？

Any help is appreciated. 任何帮助表示赞赏。

Answer 1

Sort the timestamp column first before dropping the duplicates. 在删除重复项之前，请先对“时间戳记”列进行排序。

df.withWatermark("timestamp", "1 Day")
  .sort($"timestamp".desc)
  .dropDuplicates("Id", "timestamp")

结构化流式自定义重复数据删除

问题描述

1 个解决方案

解决方案1
2 2018-09-19 12:47:44

结构化流式自定义重复数据删除

问题描述

1 个解决方案

解决方案1 2 2018-09-19 12:47:44

解决方案1
2 2018-09-19 12:47:44