简体繁体 English

SPARK：Pyspark：如何监控python worker进程

[英]SPARK: Pyspark: how to monitor python worker processes

原文 2017-04-08 12:34:59 5 1 python/ apache-spark/ pyspark

Question 题
How to monitor pyspark python worker processes in terms of CPU and memory usage. 如何在CPU和内存使用方面监视pyspark python worker进程。

Details 细节
According to this doc , one SPARK worker can contain 1 or more python processes. 根据这个文档，一个SPARK工作者可以包含一个或多个python进程。

Let's assume we have allocated 40g memory per executor running on a worker that has up to 200g memory available. 假设我们已经为每个执行器分配了40g内存，该内存在一个具有高达200g可用内存的工作器上运行。 Then according to this documented setting: "spark.python.worker.memory" we are able to set the amount of memory available per python process. 然后根据这个记录的设置：“spark.python.worker.memory”我们可以设置每个python进程可用的内存量。

Quoted from the spark.python.worker.memory setting description: 引自spark.python.worker.memory设置说明：

Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (eg 512m, 2g). 聚合期间每个python worker进程使用的内存量，格式与JVM内存字符串相同（例如512m，2g）。 If the memory used during aggregation goes above this amount, it will spill the data into disks. 如果聚合期间使用的内存超过此数量，则会将数据溢出到磁盘中。

Let's assume that we set spark.python.worker.memory to 2g. 我们假设我们将spark.python.worker.memory设置为2g。

To me the following questions arise: 对我来说，出现以下问题：

How do we know how many processes pyspark/ spark is spawning on each worker/ executor? 我们如何知道pyspark / spark在每个worker / executor上产生了多少进程？
How can we monitor how much memory we consume per process and overall to see how close we're at the 'executor 40g' limit we set? 我们如何监控每个进程消耗多少内存，以及我们设置的'执行者40g'限制的接近程度？
How can we monitor how much we're spilling to disks per process? 我们如何监控每个进程对磁盘的溢出程度？
In more general terms, how can we optimize or pyspark applications using spark.python.worker.memory setting. 更一般地说，我们如何使用spark.python.worker.memory设置优化或使用pyspark应用程序。 Is this just a question of trial/ error. 这只是一个试验/错误的问题。 If so, how can benchmark/ monitor (similar to above) 如果是这样，如何基准/监控（类似于上面）

Why.. well we are hitting some performance issues that are very specific to our application. 为什么......我们正在遇到一些非常特定于我们应用程序的性能问题。 We are observing some inconsistent errors which we cannot reproduce. 我们正在观察一些我们无法重现的不一致错误。 As such we must monitor/ understand the finer details of what is happening each time our application runs. 因此，我们必须在每次运行应用程序时监视/理解所发生情况的更精细细节。

1 个解决方案

according to this documented setting: "spark.python.worker.memory" we are able to set the amount of memory available per python process. 根据这个记录的设置：“spark.python.worker.memory”我们可以设置每个python进程可用的内存量。

This is not true. 这不是真的。 As it is explained in the documentation you've linked this setting is used to control aggregation behavior, not Python worker memory in general. 正如您在链接的文档中所解释的那样，此设置用于控制聚合行为，而不是一般的Python工作者内存。

This memory account for size of the local objects or broadcast variables, only temporary structures used for aggregations. 此内存占本地对象或广播变量的大小，仅用于聚合的临时结构。

How do we know how many processes pyspark/ spark is spawning on each worker/ executor? 我们如何知道pyspark / spark在每个worker / executor上产生了多少进程？

Python workers can be spawned up to the limit set by the number of available cores. Python工作程序可以生成到可用内核数量设置的限制。 Because workers can be started or killed during the runtime the actual number of workers outside peak load can be smaller. 因为在运行期间可以启动或杀死工人，所以在峰值负载之外的工人的实际数量可以更小。

How can we monitor how much memory we consume per process and overall to see how close we're at the 'executor 40g' limit we set? 我们如何监控每个进程消耗多少内存，以及我们设置的'执行者40g'限制的接近程度？

There is no Spark specific answer. 没有Spark特定的答案。 You can use general monitoring tools or resource module from the application itself. 您可以使用应用程序本身的常规监视工具或resource模块。

How can we monitor how much we're spilling to disks per process? 我们如何监控每个进程对磁盘的溢出程度？

You can use Spark REST API to get some insights but in general PySpark metrics are somewhat limited. 您可以使用Spark REST API获取一些见解，但一般来说PySpark指标有些限制。