简体   繁体   English

结构化流式自定义重复数据删除

[英]Structured streaming custom deduplication

I have a streaming data coming in from kafka into dataFrame. 我有一个流数据从kafka进入dataFrame。 I want to remove duplicates based in Id and keep the latest records based on timestamp. 我想删除基于ID的重复项,并根据时间戳保留最新记录。

Sample data is like this : 样本数据如下:

Id  Name    count   timestamp
1   Vikas   20      2018-09-19T10:10:10
2   Vijay   50      2018-09-19T10:10:20
3   Vilas   30      2018-09-19T10:10:30
4   Vishal  10      2018-09-19T10:10:40
1   Vikas   50      2018-09-19T10:10:50
4   Vishal  40      2018-09-19T10:11:00
1   Vikas   10      2018-09-19T10:11:10
3   Vilas   20      2018-09-19T10:11:20

The output that I am expecting would be : 我期望的输出将是:

Id  Name    count   timestamp
1   Vikas   10      2018-09-19T10:11:10
2   Vijay   50      2018-09-19T10:10:20
3   Vilas   20      2018-09-19T10:11:20
4   Vishal  40      2018-09-19T10:11:00

Older duplicates are removed and only the recent records are kept based on the timestamp field. 较旧的重复项将被删除,并且根据timestamp字段仅保留最近的记录。

I am using watermarking for timestamp field. 我正在为时间戳字段使用水印。 I have tried using "df.removeDuplicate" but it keeps older records intact and anything new gets discarded. 我尝试使用“ df.removeDuplicate”,但它会保留较旧的记录,并且不会丢弃任何新记录。

Current code is as follows : 当前代码如下:

df = df.withWatermark("timestamp", "1 Day").dropDuplicates("Id", "timestamp")

How can we implement custom dedup method so that we can keep latest record as unique record? 我们如何实现自定义dedup方法,以便将最新记录保留为唯一记录?

Any help is appreciated. 任何帮助表示赞赏。

Sort the timestamp column first before dropping the duplicates. 在删除重复项之前,请先对“时间戳记”列进行排序。

df.withWatermark("timestamp", "1 Day")
  .sort($"timestamp".desc)
  .dropDuplicates("Id", "timestamp")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM