简体   繁体   中英

How to optimize number of executor instances in spark structured streaming app?

Runtime

YARN cluster mode

Application

  • Spark structured streaming
  • Read data from Kafka topic

About Kafka topic

  • 1 topic with 4 partitions -for now. (number of partitions can be changed)
  • Added 2000 records maximum in topic per 1 second.

I've found out that the number of Kafka topic partitions is matched with the number of spark executors (1:1).
So, in my case, what I know until now, 4 spark executors is the solution I think.
But I'm worried about data throughput - can be ensured 2000 rec/sec?

Is there any guidance or recommendation about setting proper configuration in spark structured streaming?
Especially spark.executor.cores , spark.executor.instances or something about executor.

Setting spark.executor.cores to 5 or less is usually considered the most optimal for HDFS I/O throughput. you can read more about it here (or google other articles): https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Each Kafka partition is matched to a spark core, not executor (one spark core can have multiple Kafka partitions but each Kafka partition will have exactly one core).

Deciding what are the exact numbers that you need depends on many other things like your application flow (eg if you are not doing any shuffle the number of total cores should be exactly your Kafka partitions), memory capacity and requirements etc.

You can play with the configurations and use spark metrics to decide if your application is handling the throughput.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM