简体   繁体   中英

Spark - combine filter results from all executors

I have 3 executors in my spark streaming job which consumes from Kafka. Executor count depends on partition count in topic. When a message consumed from this topic, I am starting query on Hazelcast. Every executor finds results from some filtering operation on hazelcast and returns duplicated results. Because data statuses are not updated when executor returns the data and other executor finds the same data.

My question is, is there a way to combine all results in only one list which are found by executors during streaming?

Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options

  1. Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data
  2. Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.

I hope this helps

To avoid duplicate data read, you need to maintain the offset somewhere, preferred in HBase and everytime you consume the data from Kafka, you read it from HBase and then check the offset for each topic which is already consumed and then start reading and writing it. After each successful write, you must update the offset count.

Do you think that way it solves the issue?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM