从 Spark 中的 executor/worker 检索本地 object

Question

Is there a way to retrieve a local variable (or even global) from a worker/executor in Spark?有没有办法从 Spark 中的工作程序/执行程序中检索局部变量（甚至是全局变量）？ Say, I want to retrieve the list called ph_list and have the following code:说，我想检索名为 ph_list 的列表并具有以下代码：

from typing import Iterator
import pandas as pd

df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    ph_list = []
    i = 0
    for pdf in iterator:
        ph_list.append(i)
        i += 1
        yield pdf[pdf.id == 1]

df.mapInPandas(pandas_filter, schema=df.schema).show()

Once the code is executed, there is no object available by the name of ph_list.执行代码后，没有名为 ph_list 的 object 可用。 The only thing that is returned is the data frame the function is supposed to return, nothing else.唯一返回的是 function 应该返回的数据帧，仅此而已。 However, sometimes (as in this case) we want to return things (like objects) that can not be saved into a spark data frame, hence the question.但是，有时（在这种情况下）我们想要返回无法保存到 spark 数据帧中的东西（如对象），因此出现了问题。

Thanks谢谢

Answer 1

The DataFrame API is not meant to be used for custom objects. DataFrame API 不适用于自定义对象。 The advantage of the DataFrames is to have a defined schema with known types which allows Spark to optimize computations internally. DataFrames 的优点是定义了已知类型的模式，这允许 Spark 在内部优化计算。

If you need flexibility, you can use the RDD API.如果需要灵活性，可以使用RDD API。 It provides full control of the objects returned by your computations.它提供对计算返回的对象的完全控制。 However, Spark sees them as black boxes.然而，Spark 将它们视为黑匣子。

Here is quick example:这是一个简单的例子：

df.rdd.map(lambda row: pickle.dumps(row))
PythonRDD[30] at RDD at PythonRDD.scala:53

The code above serialises the rows with pickle and returns an byte object.上面的代码使用 pickle 序列化行并返回一个字节 object。 The returned type is PythonRDD , and if you do a collect , you will get a list bytes but it could be any type.返回的类型是PythonRDD ，如果你做一个collect ，你会得到一个字节列表，但它可以是任何类型。

df.rdd.map(lambda row: pickle.dumps(row)).map(type).collect()
[<class 'bytes'>, <class 'bytes'>]

EDIT: As discussed in the comments, you can do a work around by adding a BinaryType column in your DataFrame and serielise the model there.编辑：正如评论中所讨论的，您可以通过在 DataFrame 中添加 BinaryType 列并在其中对 model 进行序列化来解决问题。 Here is an example:这是一个例子：

from typing import Iterator
import pandas as pd
import pickle

df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    ph_list = []
    i = 0
    for pdf in iterator:
        ph_list.append(i)
        i += 1
        pdf_ph = pdf[pdf.id == 1]
        pdf_ph["ph_pkl"] = pickle.dumps(ph_list)
        yield pdf_ph

from pyspark.sql.types import LongType, BinaryType
new_schema = StructType([StructField("id", LongType(), True),
                         StructField("age", LongType(), True),
                         StructField("ph_pkl", BinaryType(), True)])

df.mapInPandas(pandas_filter, schema=new_schema).show()

Result:结果：

+---+---+----------------------------+
|id |age|ph_pkl                      |
+---+---+----------------------------+
|1  |21 |[80 03 5D 71 00 4B 00 61 2E]|
+---+---+----------------------------+

The list can be "collected" and unserialised:该列表可以“收集”和反序列化：

rows = df.mapInPandas(pandas_filter, schema=new_schema).take(1)
pickle.loads(rows[0].ph_pkl)

Result:结果：

[0]

从 Spark 中的 executor/worker 检索本地 object

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-01-19 18:30:03

从 Spark 中的 executor/worker 检索本地 object

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-01-19 18:30:03

解决方案1
0 已采纳 2021-01-19 18:30:03