If I call map or mapPartition
and my function receives rows from PySpark what is the natural way to create either a local PySpark or Pandas DataFrame? Something that combines the rows and retains the schema?
Currently I do something like:
def combine(partition):
rows = [x for x in partition]
dfpart = pd.DataFrame(rows,columns=rows[0].keys())
pandafunc(dfpart)
mydf.mapPartition(combine)
Spark >= 2.3.0
Since Spark 2.3.0 it is possible to use Pandas Series
or DataFrame
by partition or group. See for example:
Spark < 2.3.0
what is the natural way to create either a local PySpark
There is no such thing. Spark distributed data structures cannot be nested or you prefer another perspective you cannot nest actions or transformations.
or Pandas DataFrame
It is relatively easy but you have to remember at least few things:
collections.OrderedDict
for example). So passing columns may not work as expected. import pandas as pd
rdd = sc.parallelize([
{"x": 1, "y": -1},
{"x": -3, "y": 0},
{"x": -0, "y": 4}
])
def combine(iter):
rows = list(iter)
return [pd.DataFrame(rows)] if rows else []
rdd.mapPartitions(combine).first()
## x y
## 0 1 -1
您可以使用toPandas()
,
pandasdf = mydf.toPandas()
In order to create a spark SQL dataframe you need a hive context:
hc = HiveContext(sparkContext)
With the HiveContext you can create a SQL dataframe via the inferSchema function:
sparkSQLdataframe = hc.inferSchema(rows)
It's actually possible to convert Spark rows to Pandas inside executors & finally create Spark DataFrame out of those output using mapPartitions
. See my gist in Github
# Convert function to use in mapPartitions
def rdd_to_pandas(rdd_):
# convert rows to dict
rows = (row_.asDict() for row_ in rdd_)
# create pandas dataframe
pdf = pd.DataFrame(rows)
# Rows/Pandas DF can be empty depending on patiition logic.
# Make sure to check it here, otherwise it will throw untrackable error
if len(pdf) > 0:
#
# Do something with pandas DataFrame
#
pass
return pdf.to_dict(orient='records')
# Create Spark DataFrame from resulting RDD
rdf = spark.createDataFrame(df.rdd.mapPartitions(rdd_to_pandas))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.