[英]How can I process deleted (or updated) rows in Spark Structured Streaming?
If I wanted to count
how many people actively work at "Coca-Cola"
, I'd use the following query: 如果我想
count
一下有多少人在"Coca-Cola"
,我会使用以下查询:
people.filter(_.company == "Coca-Cola").groupByKey(_.company).count().writeStream...
This works fine in batch mode. 这在批处理模式下工作正常。
However, assuming the company
field for a person
changes over time, or assuming people get removed from the Dataset
entirely, how could I get this working with Structured Streaming, so the count
remains correct? 但是,假设一个
person
的company
字段随着时间的推移person
变化,或假设人们完全从Dataset
删除,我怎么能使用结构化流式传输,所以count
仍然正确?
AFAIK Structured Streaming assumes the data source is append-only: does that mean I need to track deletions and updates as separate data sources, and merge them myself? AFAIK Structured Streaming假设数据源是仅附加的:这是否意味着我需要将删除和更新作为单独的数据源进行跟踪,并自己合并它们?
In general the model of structured streaming is that you are reading from an ever-growing append only table. 一般来说,结构化流媒体的模型是您正在阅读不断增长的仅附加表。 You are correct that this means that in order to answer your question you will have to model changing a value as a deletion (possibly with a negative value in a field like
numEmployees
) followed by an insertion. 你是对的,这意味着为了回答你的问题,你必须建模一个值作为删除(可能在像
numEmployees
这样的字段中使用numEmployees
)然后插入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.