YARN cluster mode
I've found out that the number of Kafka topic partitions is matched with the number of spark executors (1:1).
So, in my case, what I know until now, 4 spark executors is the solution I think.
But I'm worried about data throughput - can be ensured 2000 rec/sec?
Is there any guidance or recommendation about setting proper configuration in spark structured streaming?
Especially spark.executor.cores
, spark.executor.instances
or something about executor.
Setting spark.executor.cores
to 5 or less is usually considered the most optimal for HDFS I/O throughput. you can read more about it here (or google other articles): https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Each Kafka partition is matched to a spark core, not executor (one spark core can have multiple Kafka partitions but each Kafka partition will have exactly one core).
Deciding what are the exact numbers that you need depends on many other things like your application flow (eg if you are not doing any shuffle the number of total cores should be exactly your Kafka partitions), memory capacity and requirements etc.
You can play with the configurations and use spark metrics to decide if your application is handling the throughput.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.