How can I process deleted (or updated) rows in Spark Structured Streaming?

Question

If I wanted to count how many people actively work at "Coca-Cola" , I'd use the following query:

people.filter(_.company == "Coca-Cola").groupByKey(_.company).count().writeStream...

This works fine in batch mode.

However, assuming the company field for a person changes over time, or assuming people get removed from the Dataset entirely, how could I get this working with Structured Streaming, so the count remains correct?

AFAIK Structured Streaming assumes the data source is append-only: does that mean I need to track deletions and updates as separate data sources, and merge them myself?

Answer 1

In general the model of structured streaming is that you are reading from an ever-growing append only table. You are correct that this means that in order to answer your question you will have to model changing a value as a deletion (possibly with a negative value in a field like numEmployees ) followed by an insertion.

How can I process deleted (or updated) rows in Spark Structured Streaming?

Question

1 answers

solution1
0 2017-02-23 23:55:55

How can I process deleted (or updated) rows in Spark Structured Streaming?

Question

1 answers

solution1 0 2017-02-23 23:55:55

solution1
0 2017-02-23 23:55:55