[英]pyspark create dictionary data from pyspark sql dataframe
擁有一個具有以下結構的pyspark.sql.dataframe.DataFrame,並且在以下給出的所有國家/地區中,這種情況將持續所有月份:
+----------+-------+------------------+
|DATE |COUNTRY|AVG_TEMPS |
+----------+-------+------------------+
|2007-01-01|Åland |0.5939999999999999|
|2007-02-01|Åland |-4.042 |
|2007-03-01|Åland |2.443 |
|2007-04-01|Åland |4.621 |
|2007-05-01|Åland |8.411 |
|2007-06-01|Åland |13.722999999999999|
|2007-07-01|Åland |15.749 |
+----------+-------+------------------+
預期的輸出是python字典,如下面的給定鏈接:
pyspark-在地圖類型結構中創建DataFrame分組列
-----------------------------------------
| DATE | COUNTRY_TEMP |
-----------------------------------------
|2007-01-01|{Åland: 0.593, Alfredo:2.44}|
|2007-01-02| {Åland: 0.57, Alfredo:2.14}|
-----------------------------------------
當我嘗試遵循該規則時,出現一些錯誤
df_converted = newres.groupBy('DATE').\
agg(collect_list(create_map(col("COUNTRY"))))
錯誤:
AnalysisException: u"cannot resolve 'map(`COUNTRY`)' due to data type mismatch: map expects a positive even number of arguments.
;;\n'Aggregate [DATE#179], [DATE#179, collect_list(map(COUNTRY#180), 0, 0) AS collect_list(map(COUNTRY))#189]\n+- Project [DATE#146 AS DATE#179,
COUNTRY#85 AS COUNTRY#180, AVG_TEMPS#147 AS AVG_TEMPS#181]\n +- Project [dt#82 AS DATE#146, COUNTRY#85, AverageTemperature#83 AS AVG_TEMPS#147]
\n +- SubqueryAlias global_temps_by_cntry\n +- Relation[dt#82,AverageTemperature#83,AverageTemperatureUncertainty#84,Country#85] csv\n"
有人可以幫忙嗎?
如@ user3689574所述,請嘗試將值添加到create_map:
df = spark.createDataFrame([('2007-01-01', 'Aland', 0.593), ('2007-01-01', 'Alfredo', 2.44),('2007-01-02', 'Aland', 2.57), ('2007-01-02', 'Alfredo', 2.14)], ['DATE', 'COUNTRY', 'AVG_TEMPS'])
df.show()
+----------+-------+---------+
| DATE |COUNTRY|AVG_TEMPS|
+----------+-------+---------+
|2007-01-01| Aland| 0.593|
|2007-01-01|Alfredo| 2.44|
|2007-01-02| Aland| 2.57|
|2007-01-02|Alfredo| 2.14|
+----------+-------+---------+
from pyspark.sql.functions import collect_list, col, create_map
df2 = df.groupBy("DATE").agg(collect_list( create_map( func.col("COUNTRY"), col("AVG_TEMPS") ) ).alias("COUNTRY_TEMP"))
df2.show(4, False)
+----------+-------------------------------------+
|DATE |COUNTRY_TEMP |
+----------+-------------------------------------+
|2007-01-01|[[Aland -> 0.593], [Alfredo -> 2.44]]|
|2007-01-02|[[Aland -> 2.57], [Alfredo -> 2.14]] |
+----------+-------------------------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.