[英]Create multidict from pyspark dataframe
I am new to pyspark and want to create a dictionary from a pyspark dataframe. 我是pyspark的新手,并且想从pyspark数据框创建字典。 I do have a working pandas code but I need an equivalent command in pyspark and somehow I am not able to figure out how to do it. 我确实有一个正常的熊猫代码,但是我需要在pyspark中使用一个等效的命令,但是我不知道该怎么做。
df = spark.createDataFrame([
(11, 101, 5.9),
(11, 102, 5.4),
(22, 111, 5.2),
(22, 112, 5.9),
(22, 101, 5.7),
(33, 101, 5.2),
(44, 102, 5.3),
], ['user_id', 'team_id', 'height'])
df = df.select(['user_id', 'team_id'])
df.show()
-------+-------+
|user_id|team_id|
+-------+-------+
| 11| 101|
| 11| 102|
| 22| 111|
| 22| 112|
| 22| 101|
| 33| 101|
| 44| 102|
+-------+-------+
df.toPandas().groupby('user_id')[
'team_id'].apply(list).to_dict()
Result:
{11: [101, 102], 22: [111, 112, 101], 33: [101], 44: [102]}
Looking for efficient way in pyspark to create the above multidict. 在pyspark中寻找有效的方法来创建上述multidict。
You can aggregate the team_id
column as list and then collect the rdd
as dictionary using collectAsMap
method: 您可以将team_id
列聚合为列表,然后使用collectAsMap
方法将rdd
收集为字典:
mport pyspark.sql.functions as F
df.groupBy("user_id").agg(F.collect_list("team_id")).rdd.collectAsMap()
# {33: [101], 11: [101, 102], 44: [102], 22: [111, 112, 101]}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.