简体   繁体   中英

Why does Spark Streaming need a certain number of CPU cores to run correctly?

The Spark Streaming documentation notes:

it is important to remember that Spark Streaming application needs to be allocated enough cores to process the received data, as well as, to run the receiver

and then:

If the number of cores allocated to the application is less than or equal to the number of input DStreams / receivers, then the system will receive data, but not be able to process them

This seems surprising as OSes would schedule CPU such that the application progresses, regardless of how many CPU cores are there unless it's somehow prevented from doing so. My questions are:

  • Does Spark do something special to prevent normal CPU scheduling?
  • If so, what's the rational behind it?

I just realised that they must mean 'thread' by 'core'. If there are not enough threads, it will certainly lead to thread starvation. Inline with this, I could create a local cluster with more 'cores' than the physical cores available (eg "local[10]" on a machine with only 4 CPU cores).

No, it looks that the documentation is correct and means physical CPU cores, not threads. Starting six readers on a 4-core machine will cause the whole Spark Streaming application to stall, even with --local[10]. At the same time, the same app runs flawlessly on the machine with 8 cores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM