简体   繁体   中英

Retrieve a local object from executor/worker in Spark

Is there a way to retrieve a local variable (or even global) from a worker/executor in Spark? Say, I want to retrieve the list called ph_list and have the following code:

from typing import Iterator
import pandas as pd

df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    ph_list = []
    i = 0
    for pdf in iterator:
        ph_list.append(i)
        i += 1
        yield pdf[pdf.id == 1]

df.mapInPandas(pandas_filter, schema=df.schema).show()

Once the code is executed, there is no object available by the name of ph_list. The only thing that is returned is the data frame the function is supposed to return, nothing else. However, sometimes (as in this case) we want to return things (like objects) that can not be saved into a spark data frame, hence the question.

Thanks

The DataFrame API is not meant to be used for custom objects. The advantage of the DataFrames is to have a defined schema with known types which allows Spark to optimize computations internally.

If you need flexibility, you can use the RDD API. It provides full control of the objects returned by your computations. However, Spark sees them as black boxes.

Here is quick example:

df.rdd.map(lambda row: pickle.dumps(row))
PythonRDD[30] at RDD at PythonRDD.scala:53

The code above serialises the rows with pickle and returns an byte object. The returned type is PythonRDD , and if you do a collect , you will get a list bytes but it could be any type.

df.rdd.map(lambda row: pickle.dumps(row)).map(type).collect()
[<class 'bytes'>, <class 'bytes'>]

EDIT: As discussed in the comments, you can do a work around by adding a BinaryType column in your DataFrame and serielise the model there. Here is an example:

from typing import Iterator
import pandas as pd
import pickle

df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    ph_list = []
    i = 0
    for pdf in iterator:
        ph_list.append(i)
        i += 1
        pdf_ph = pdf[pdf.id == 1]
        pdf_ph["ph_pkl"] = pickle.dumps(ph_list)
        yield pdf_ph

from pyspark.sql.types import LongType, BinaryType
new_schema = StructType([StructField("id", LongType(), True),
                         StructField("age", LongType(), True),
                         StructField("ph_pkl", BinaryType(), True)])

df.mapInPandas(pandas_filter, schema=new_schema).show()

Result:

+---+---+----------------------------+
|id |age|ph_pkl                      |
+---+---+----------------------------+
|1  |21 |[80 03 5D 71 00 4B 00 61 2E]|
+---+---+----------------------------+

The list can be "collected" and unserialised:

rows = df.mapInPandas(pandas_filter, schema=new_schema).take(1)
pickle.loads(rows[0].ph_pkl)

Result:

[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM