简体繁体中英

Spark - combine filter results from all executors

原文 2018-11-23 09:00:53 2 2 java/ apache-spark/ hazelcast

I have 3 executors in my spark streaming job which consumes from Kafka. Executor count depends on partition count in topic. When a message consumed from this topic, I am starting query on Hazelcast. Every executor finds results from some filtering operation on hazelcast and returns duplicated results. Because data statuses are not updated when executor returns the data and other executor finds the same data.

My question is, is there a way to combine all results in only one list which are found by executors during streaming?

2 answers

Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options

Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data
Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.

I hope this helps

To avoid duplicate data read, you need to maintain the offset somewhere, preferred in HBase and everytime you consume the data from Kafka, you read it from HBase and then check the offset for each topic which is already consumed and then start reading and writing it. After each successful write, you must update the offset count.

Do you think that way it solves the issue?

Executors and cores in Apache Spark

Spark Kubernetes - FileNotFoundException when copying config files from driver to executors using --files or spark.files

Spark OutOfMemoryError when adding executors

Commit Offsets to Kafka on Spark Executors

How to set amount of Spark executors?

How to combine the results of Gabor filter in OpenCV

Executors not completing all the task

Executors are not running all the threads.

Sharing Zookeeper configuration on multiple Spark Executors

Spark executors fails on Kubernetes (EKS) --podName is missing

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Executors and cores in Apache Spark Spark Kubernetes - FileNotFoundException when copying config files from driver to executors using --files or spark.files Spark OutOfMemoryError when adding executors Commit Offsets to Kafka on Spark Executors How to set amount of Spark executors? How to combine the results of Gabor filter in OpenCV Executors not completing all the task Executors are not running all the threads. Sharing Zookeeper configuration on multiple Spark Executors Spark executors fails on Kubernetes (EKS) --podName is missing

Related Tags

Spark - combine filter results from all executors

Question

2 answers

solution1
0 2018-11-28 06:47:08

solution2
0 2018-11-29 14:24:46

Spark - combine filter results from all executors

Question

2 answers

solution1 0 2018-11-28 06:47:08

solution2 0 2018-11-29 14:24:46

solution1
0 2018-11-28 06:47:08

solution2
0 2018-11-29 14:24:46