pySpark forEachPartition - 代码在哪里执行

Question

I'm using pySpark in version 2.3 (cannot update to 2.4 in my current dev-System) and have the following questions concerning the foreachPartition .我在 2.3 版中使用 pySpark（在我当前的开发系统中无法更新到 2.4）并且有以下关于foreachPartition的问题。

First a little context: As far as I understood pySpark- UDFs force the Python-code to be executed outside the Java Virtual Machine (JVM) in a Python-instance, making it performance-costing.首先是一点背景：据我了解，pySpark- UDFs强制 Python 代码在 Python 实例中的 Java 虚拟机 (JVM) 之外执行，从而降低性能。 Since I need to apply some Python-functions to my data and want to minimize overhead costs, I had the idea to at least load a handable bunch of data into the driver and process it as Pandas-DataFrame.由于我需要对我的数据应用一些 Python 函数并希望将开销成本降至最低，因此我想到了至少将一组可处理的数据加载到驱动程序中并将其作为 Pandas-DataFrame 进行处理。 Anyhow, this would lead to a loss of the parallelism-advantage Spark has.无论如何，这将导致 Spark 失去并行优势。 Then I read that foreachPartition applies a function to all the data within a partition and, hence, allows parallel processing.然后我读到foreachPartition将函数应用于分区内的所有数据，因此允许并行处理。

My questions now are:我现在的问题是：

When I apply a Python-function via foreachPartition , does the Python-execution take place within the driver process (and the partition-data is therefore transfered over the network to my driver)?当我通过foreachPartition应用 Python 函数时，Python 执行是否发生在驱动程序进程中（分区数据因此通过网络传输到我的驱动程序）？
Is the data processed row-wise within foreachPartition (meaning every RDD-row is transfered one by one to the Python-instance), or is the partition-data processed at once (meaning, for example, the whole partition is transfered to the instance and is handled as whole by one Python-instance)?数据是在foreachPartition中按行处理的（意味着每个 RDD 行都被一个一个地传输到 Python 实例），还是一次处理分区数据（意味着，例如，整个分区被传输到实例并由一个 Python 实例整体处理）？

Thank you in advance for your input!预先感谢您的输入！

Edit:编辑：

A working in driver-solution I used before looks like this, taken from SO here :我之前使用的驱动程序解决方案看起来像这样，取自 SO here ：

for partition in rdd.mapPartitions(lambda partition: [list(partition)]).toLocalIterator():
    # Do stuff on the partition

As can be read from the docs rdd.toLocalIterator() provides the necessary functionality:从文档中可以看出， rdd.toLocalIterator()提供了必要的功能：

Return an iterator that contains all of the elements in this RDD.返回包含此 RDD 中所有元素的迭代器。 The iterator will consume as much memory as the largest partition in this RDD.迭代器将消耗与该 RDD 中最大分区一样多的内存。

Answer 1

Luckily I stumbled upon this great explanation of mapPartitions from Mrinal (answered here ).幸运的是，我偶然发现了 Mrinal 对mapPartitions的精彩解释（在此处回答）。

mapPartitions applies a function on each partition of an RDD. mapPartitions在 RDD 的每个分区上应用一个函数。 Hence, parallelization can be used if the partitions are distributed over different nodes.因此，如果分区分布在不同的节点上，则可以使用并行化。 The corresponding Python-instances, which are necessary for processing the Python-functions, are created on these nodes.在这些节点上创建处理 Python 函数所必需的相应 Python 实例。 While foreachPartition only applies a function (eg write your data in a.csv-file), mapPartitions also returns a new RDD.虽然foreachPartition仅应用一个函数（例如，将您的数据写入 a.csv 文件）， mapPartitions还返回一个新的 RDD。 Therefore, using foreachPartition was the wrong choice for me.因此，使用foreachPartition对我来说是错误的选择。

In order to answer my second question: Functions like map or UDFs create a new Python-instance and pass data from the DataFrame/RDD row-by-row, resulting in a lot of overhead.为了回答我的第二个问题：像map或UDFs这样的函数会创建一个新的 Python 实例，并逐行传递来自 DataFrame/RDD 的数据，从而导致大量开销。 foreachPartition and mapPartitions (both RDD-functions) transfer an entire partition to a Python-instance. foreachPartition和mapPartitions （均为 RDD 函数）将整个分区传输到 Python 实例。

Additionally, using generators also reduces the amount of memory necessary for iterating over this transferred partition data (partitions are handled as iterator objects, while each row is then processed by iterating over this object).此外，使用生成器还减少了迭代此传输的分区数据所需的内存量（分区作为迭代器对象处理，而每一行然后通过迭代此对象来处理）。

An example might look like:一个示例可能如下所示：

def generator(partition):
    """
    Function yielding some result created by some function applied to each row of a partition (in this case lower-casing a string)

    @partition: iterator-object of partition
    """

    for row in partition:
        yield [word.lower() for word in row["text"]]


df = spark.createDataFrame([(["TESTA"], ), (["TESTB"], )], ["text"])
df = df.repartition(2)
df.rdd.mapPartitions(generator).toDF(["text"]).show()


#Result:
+-----+
| text|
+-----+
|testa|
|testb|
+-----+

Hope this helps somebody facing similar problems:)希望这可以帮助面临类似问题的人:)

Answer 2

pySpark UDFs execute near the executors - ie in a sperate python instance, per executor, that runs side-by-side and passes data back and forth between the spark engine (scala) and the python interpreter. pySpark UDF 在执行器附近执行——即在一个独立的 python 实例中，每个执行器并排运行并在 spark 引擎 (scala) 和 python 解释器之间来回传递数据。

the same is true for calls to udfs inside a foreachPartition对于在 foreachPartition 中调用 udfs 也是如此

Edit - after looking at the sample code编辑 - 查看示例代码后

using RDDs is not an efficient way of using spark - you should move to datasets使用 RDD 并不是使用 spark 的有效方式——你应该转向数据集
what makes your code sync all data to the driver is the collect()使您的代码将所有数据同步到驱动程序的是 collect()
foreachParition will be similar to glom foreachParition 将类似于 glom

pySpark forEachPartition - 代码在哪里执行

问题描述

2 个解决方案

解决方案1
9 已采纳 2019-04-15 22:44:49

解决方案2
0 2019-04-12 15:32:08

pySpark forEachPartition - 代码在哪里执行

问题描述

2 个解决方案

解决方案1 9 已采纳 2019-04-15 22:44:49

解决方案2 0 2019-04-12 15:32:08

解决方案1
9 已采纳 2019-04-15 22:44:49

解决方案2
0 2019-04-12 15:32:08