来自 Elasticsearch 的 Spark 负载：执行程序和分区的数量

Question

I'm trying to load data from an Elasticsearch index into a dataframe in Spark.我正在尝试将数据从 Elasticsearch 索引加载到 Spark 中的数据帧中。 My machine has 12 CPU's and 1 core.我的机器有 12 个 CPU 和 1 个内核。 I'm using PySpark on a Jupyter Notebook with the following Spark config:我在具有以下 Spark 配置的 Jupyter Notebook 上使用 PySpark：

pathElkJar = currentUserFolder+"/elasticsearch-hadoop-"+connectorVersion+"/dist/elasticsearch- spark-20_2.11-"+connectorVersion+".jar"

spark = SparkSession.builder \
    .appName("elastic") \
    .config("spark.jars",pathElkJar) \
    .enableHiveSupport() \
    .getOrCreate()

Now whether I do:现在我是否这样做：

df = es_reader.load()

or:或者：

df = es_reader.load(numPartitions=12)

I get the same output from the following prints:我从以下打印中得到相同的输出：

print('Master: {}'.format(spark.sparkContext.master))
print('Number of partitions: {}'.format(df.rdd.getNumPartitions()))
print('Number of executors:{}'.format(spark.sparkContext._conf.get('spark.executor.instances')))
print('Partitioner: {}'.format(df.rdd.partitioner))
print('Partitions structure: {}'.format(df.rdd.glom().collect()))

Master: local[*]
Number of partitions: 1
Number of executors: None
Partitioner: None

I was expecting 12 partitions, which I can only obtain by doing a repartition() on the dataframe.我期待 12 个分区，我只能通过在数据帧上执行 repartition repartition()来获得。 Furthermore I thought that the number of executors by default equals the number of CPU's.此外，我认为默认情况下执行器的数量等于 CPU 的数量。 But even by doing the following:但即使通过执行以下操作：

spark.conf.set("spark.executor.instances", "12")

I can't manually set the number of executors.我无法手动设置执行者的数量。 It is true I have 1 core for each of the 12 CPU's, but how should I go about it?确实，我为 12 个 CPU 中的每一个都有 1 个内核，但是我应该怎么做呢？

I modified the configuration file after creating the Spark session (without restarting this obviously leads to no changes), by specifying the number of executor as follows:我在创建 Spark 会话后修改了配置文件（没有重新启动这显然不会导致任何更改），通过指定执行程序的数量如下：

spark = SparkSession.builder \
    .appName("elastic") \
    .config("spark.jars",pathElkJar) \
    .config("spark.executor.instances", "12") \
    .enableHiveSupport() \
    .getOrCreate()

I now correctly get 12 executors.我现在正确地获得了 12 个执行者。 Still I don't understand why it doesn't do it automatically and still the number of partitions when loading the dataframe is 1. I would expect it to be 12 as the number of executors, am I right?我仍然不明白为什么它不自动执行并且加载数据帧时的分区数仍然是 1。我希望它是 12 作为执行程序的数量，对吗？

Answer 1

The problem regarding the executors and partitioning arised from the fact that i was using spark in local mode which allows for one executor maximum.关于执行程序和分区的问题源于我在本地模式下使用 spark 的事实，该模式允许最多一个执行程序。 Using Yarn or other resource managers such as mesos solved the problem使用 Yarn 或其他资源管理器如 mesos 解决了这个问题

来自 Elasticsearch 的 Spark 负载：执行程序和分区的数量

问题描述

1 个解决方案

解决方案1
0 2020-11-13 10:42:00

来自 Elasticsearch 的 Spark 负载：执行程序和分区的数量

问题描述

1 个解决方案

解决方案1 0 2020-11-13 10:42:00

解决方案1
0 2020-11-13 10:42:00