![](/img/trans.png)
[英]How to convert Spark RDD to pandas dataframe in ipython?
[英]How to convert Pandas Dataframe coming from RDD.mapPartitions() into Spark DataFrame?
我有一個返回Pandas DataFrame的Python函數。 我正在使用pyspark的RDD.mapPartitions()
在Spark 2.2.0中調用此函數。 但是我無法將mapPartitions()
返回的RDD轉換為Spark DataFrame。 熊貓會產生此錯誤:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
簡單的代碼說明了問題:
import pandas as pd
def func(data):
pdf = pd.DataFrame(list(data), columns=("A", "B", "C"))
pdf += 10 # Add 10 to every value. The real function is a lot more complex!
return [pdf]
pdf = pd.DataFrame([(1.87, 0.6, 7.1), (-0.3, 0.1, 8.2), (2.8, 0.3, 6.1), (-0.2, 0.5, 5.9)], columns=("A", "B", "C"))
sdf = spark.createDataFrame(pdf)
sdf.show()
rddIn = sdf.rdd
for i in rddIn.collect():
print(i)
result = rddIn.mapPartitions(func)
for i in result.collect():
print(i)
resDf = spark.createDataFrame(result) # --> ValueError!
resDf.show()
輸出為:
+----+---+---+
| A| B| C|
+----+---+---+
|1.87|0.6|7.1|
|-0.3|0.1|8.2|
| 2.8|0.3|6.1|
|-0.2|0.5|5.9|
+----+---+---+
Row(A=1.87, B=0.6, C=7.1)
Row(A=-0.3, B=0.1, C=8.2)
Row(A=2.8, B=0.3, C=6.1)
Row(A=-0.2, B=0.5, C=5.9)
A B C
0 11.87 10.6 17.1
A B C
0 9.7 10.1 18.2
A B C
0 12.8 10.3 16.1
A B C
0 9.8 10.5 15.9
但倒數第二行會產生上述ValueError
。 我真的希望resDf.show()
看起來與sdf.show()
完全一樣,只是在表中的每個值上都添加了10。 理想情況下, result
RDD應該具有與rddIn
相同的結構,RDD進入mapPartitions()
。
您必須將數據轉換為標准Python類型並進行展平:
resDf = spark.createDataFrame(
result.flatMap(lambda df: (r.tolist() for r in df.to_records()))
)
resDF.show()
# +---+------------------+----+----+
# | _1| _2| _3| _4|
# +---+------------------+----+----+
# | 0|11.870000000000001|10.6|17.1|
# | 0| 9.7|10.1|18.2|
# | 0| 12.8|10.3|16.1|
# | 0| 9.8|10.5|15.9|
# +---+------------------+----+----+
如果您使用Spark 2.3,這也應該有效
from pyspark.sql.functions import pandas_udf, spark_partition_id
from pyspark.sql.functions import PandasUDFType
@pandas_udf(sdf.schema, functionType=PandasUDFType.GROUPED_MAP)
def func(pdf):
pdf += 10
return pdf
sdf.groupBy(spark_partition_id().alias("_pid")).apply(func)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.