简体   繁体   中英

SPARK: Pyspark: how to monitor python worker processes

Question
How to monitor pyspark python worker processes in terms of CPU and memory usage.

Details
According to this doc , one SPARK worker can contain 1 or more python processes.

Let's assume we have allocated 40g memory per executor running on a worker that has up to 200g memory available. Then according to this documented setting: "spark.python.worker.memory" we are able to set the amount of memory available per python process.

Quoted from the spark.python.worker.memory setting description:

Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (eg 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.

Let's assume that we set spark.python.worker.memory to 2g.

To me the following questions arise:

  • How do we know how many processes pyspark/ spark is spawning on each worker/ executor?
  • How can we monitor how much memory we consume per process and overall to see how close we're at the 'executor 40g' limit we set?
  • How can we monitor how much we're spilling to disks per process?
  • In more general terms, how can we optimize or pyspark applications using spark.python.worker.memory setting. Is this just a question of trial/ error. If so, how can benchmark/ monitor (similar to above)



Why.. well we are hitting some performance issues that are very specific to our application. We are observing some inconsistent errors which we cannot reproduce. As such we must monitor/ understand the finer details of what is happening each time our application runs.

according to this documented setting: "spark.python.worker.memory" we are able to set the amount of memory available per python process.

This is not true. As it is explained in the documentation you've linked this setting is used to control aggregation behavior, not Python worker memory in general.

This memory account for size of the local objects or broadcast variables, only temporary structures used for aggregations.

How do we know how many processes pyspark/ spark is spawning on each worker/ executor?

Python workers can be spawned up to the limit set by the number of available cores. Because workers can be started or killed during the runtime the actual number of workers outside peak load can be smaller.

How can we monitor how much memory we consume per process and overall to see how close we're at the 'executor 40g' limit we set?

There is no Spark specific answer. You can use general monitoring tools or resource module from the application itself.

How can we monitor how much we're spilling to disks per process?

You can use Spark REST API to get some insights but in general PySpark metrics are somewhat limited.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM