如何在Spark Structured Streaming中处理已删除（或更新）的行？

Question

If I wanted to count how many people actively work at "Coca-Cola" , I'd use the following query: 如果我想count一下有多少人在"Coca-Cola" ，我会使用以下查询：

people.filter(_.company == "Coca-Cola").groupByKey(_.company).count().writeStream...

This works fine in batch mode. 这在批处理模式下工作正常。

However, assuming the company field for a person changes over time, or assuming people get removed from the Dataset entirely, how could I get this working with Structured Streaming, so the count remains correct? 但是，假设一个person的company字段随着时间的推移person变化，或假设人们完全从Dataset删除，我怎么能使用结构化流式传输，所以count仍然正确？

AFAIK Structured Streaming assumes the data source is append-only: does that mean I need to track deletions and updates as separate data sources, and merge them myself? AFAIK Structured Streaming假设数据源是仅附加的：这是否意味着我需要将删除和更新作为单独的数据源进行跟踪，并自己合并它们？

Answer 1

In general the model of structured streaming is that you are reading from an ever-growing append only table. 一般来说，结构化流媒体的模型是您正在阅读不断增长的仅附加表。 You are correct that this means that in order to answer your question you will have to model changing a value as a deletion (possibly with a negative value in a field like numEmployees ) followed by an insertion. 你是对的，这意味着为了回答你的问题，你必须建模一个值作为删除（可能在像numEmployees这样的字段中使用numEmployees ）然后插入。

如何在Spark Structured Streaming中处理已删除（或更新）的行？

问题描述

1 个解决方案

解决方案1
0 2017-02-23 23:55:55

如何在Spark Structured Streaming中处理已删除（或更新）的行？

问题描述

1 个解决方案

解决方案1 0 2017-02-23 23:55:55

解决方案1
0 2017-02-23 23:55:55