简体繁体中英

SPARK: Pyspark: how to monitor python worker processes

原文 2017-04-08 12:34:59 3 1 python/ apache-spark/ pyspark

Question
How to monitor pyspark python worker processes in terms of CPU and memory usage.

Details
According to this doc , one SPARK worker can contain 1 or more python processes.

Let's assume we have allocated 40g memory per executor running on a worker that has up to 200g memory available. Then according to this documented setting: "spark.python.worker.memory" we are able to set the amount of memory available per python process.

Quoted from the spark.python.worker.memory setting description:

Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (eg 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.

Let's assume that we set spark.python.worker.memory to 2g.

To me the following questions arise:

How do we know how many processes pyspark/ spark is spawning on each worker/ executor?
How can we monitor how much memory we consume per process and overall to see how close we're at the 'executor 40g' limit we set?
How can we monitor how much we're spilling to disks per process?
In more general terms, how can we optimize or pyspark applications using spark.python.worker.memory setting. Is this just a question of trial/ error. If so, how can benchmark/ monitor (similar to above)

Why.. well we are hitting some performance issues that are very specific to our application. We are observing some inconsistent errors which we cannot reproduce. As such we must monitor/ understand the finer details of what is happening each time our application runs.

1 answers

according to this documented setting: "spark.python.worker.memory" we are able to set the amount of memory available per python process.

This is not true. As it is explained in the documentation you've linked this setting is used to control aggregation behavior, not Python worker memory in general.

This memory account for size of the local objects or broadcast variables, only temporary structures used for aggregations.

How do we know how many processes pyspark/ spark is spawning on each worker/ executor?

Python workers can be spawned up to the limit set by the number of available cores. Because workers can be started or killed during the runtime the actual number of workers outside peak load can be smaller.

How can we monitor how much memory we consume per process and overall to see how close we're at the 'executor 40g' limit we set?

There is no Spark specific answer. You can use general monitoring tools or resource module from the application itself.

How can we monitor how much we're spilling to disks per process?

You can use Spark REST API to get some insights but in general PySpark metrics are somewhat limited.

monitor stuck python processes

python Pool with worker Processes

Python: How to monitor status of multiple Python processes running in Windows Scheduler

Logging worker processes with Parallel Python

How to correctly shut down Python RQ worker processes dynamically?

Apache Spark: How to use pyspark with Python 3

Apache Spark: How to use Python 3 with pySpark for development

Monitor the number of processes spawned by subprocess in python

Python script to monitor process and sub-processes

Apache Spark's worker python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question monitor stuck python processes python Pool with worker Processes Python: How to monitor status of multiple Python processes running in Windows Scheduler Logging worker processes with Parallel Python How to correctly shut down Python RQ worker processes dynamically? Apache Spark: How to use pyspark with Python 3 Apache Spark: How to use Python 3 with pySpark for development Monitor the number of processes spawned by subprocess in python Python script to monitor process and sub-processes Apache Spark's worker python

Related Tags

SPARK: Pyspark: how to monitor python worker processes

Question

1 answers

solution1 2 2017-04-09 10:49:16

solution1
2 2017-04-09 10:49:16