If I wanted to count
how many people actively work at "Coca-Cola"
, I'd use the following query:
people.filter(_.company == "Coca-Cola").groupByKey(_.company).count().writeStream...
This works fine in batch mode.
However, assuming the company
field for a person
changes over time, or assuming people get removed from the Dataset
entirely, how could I get this working with Structured Streaming, so the count
remains correct?
AFAIK Structured Streaming assumes the data source is append-only: does that mean I need to track deletions and updates as separate data sources, and merge them myself?
In general the model of structured streaming is that you are reading from an ever-growing append only table. You are correct that this means that in order to answer your question you will have to model changing a value as a deletion (possibly with a negative value in a field like numEmployees
) followed by an insertion.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.