简体   繁体   English

如何获取 PySpark 中的工人(执行者)数量?

[英]How to get the number of workers(executors) in PySpark?

I need to use this parameter, so how can I get the number of workers?我需要用到这个参数,那么如何获取worker的数量呢? Like in Scala, I can call sc.getExecutorMemoryStatus to get the available number of workers.就像在 Scala 中一样,我可以调用sc.getExecutorMemoryStatus来获取可用的工人数。 But in PySpark, it seems there's no API exposed to get this number.但是在PySpark中,好像没有暴露出API来获取这个数字。

In scala, getExecutorStorageStatus and getExecutorMemoryStatus both return the number of executors including driver.在 scala 中, getExecutorStorageStatusgetExecutorMemoryStatus都返回包括驱动程序在内的执行程序的数量。 like below example snippet像下面的示例片段

/** Method that just returns the current active/registered executors
        * excluding the driver.
        * @param sc The spark context to retrieve registered executors.
        * @return a list of executors each in the form of host:port.
        */
       def currentActiveExecutors(sc: SparkContext): Seq[String] = {
         val allExecutors = sc.getExecutorMemoryStatus.map(_._1)
         val driverHost: String = sc.getConf.get("spark.driver.host")
         allExecutors.filter(! _.split(":")(0).equals(driverHost)).toList
       }

But In python api it was not implemented 但是在python api中它没有实现

@DanielDarabos answer also confirms this. @DanielDarabos 的回答也证实了这一点。

The equivalent to this in python...相当于python中的这个......

sc.getConf().get("spark.executor.instances")

Edit (python) :编辑(蟒蛇):

%python
sc = spark._jsc.sc() 
n_workers =  len([executor.host() for executor in sc.statusTracker().getExecutorInfos() ]) -1

print(n_workers)

As Danny mentioned in the comment if you want to cross verify them you can use the below statements.正如丹尼在评论中提到的,如果你想交叉验证它们,你可以使用以下语句。

%python

sc = spark._jsc.sc() 

result1 = sc.getExecutorMemoryStatus().keys() # will print all the executors + driver available

result2 = len([executor.host() for executor in sc.statusTracker().getExecutorInfos() ]) -1

print(result1, end ='\n')
print(result2)

Example Result :示例结果:

Set(10.172.249.9:46467)
0

You can also get the number of executors by Spark REST API: https://spark.apache.org/docs/latest/monitoring.html#rest-api您还可以通过 Spark REST API 获取执行程序的数量: https : //spark.apache.org/docs/latest/monitoring.html#rest-api

You can check /applications/[app-id]/executors , which returns A list of all active executors for the given application .您可以检查/applications/[app-id]/executors ,它返回给定应用程序的所有活动执行程序的列表


PS: When spark.dynamicAllocation.enabled is true , spark.executor.instances may not equals to the current available executors, but this API always returns the correct value. PS:当spark.dynamicAllocation.enabledtruespark.executor.instances可能不等于当前可用的 executor,但此 API 始终返回正确的值。

I instantiated the SparkContext this way but no one of the solutions worked :我以这种方式实例化了 SparkContext但没有一种解决方案有效

conf = SparkConf().setMaster(MASTER_CONNECTION_URL).setAppName('App name')
sc = SparkContext(conf=conf)

So I changed my code to instantiate the SparkContext with pyspark.sql.SparkSession and everything worked fine:所以我更改了我的代码以使用 pyspark.sql.SparkSession 实例化pyspark.sql.SparkSession并且一切正常:

# Gets Spark context
conf = SparkConf().setMaster(MASTER_CONNECTION_URL).setAppName('App name')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext

# Gets the number of workers
spark = SparkContext.getOrCreate(conf=conf)
sc2 = spark._jsc.sc()
number_of_workers = len([executor.host() for executor in
                sc2.statusTracker().getExecutorInfos()]) - 1  # Subtract 1 to discard the master

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM