如何在 pyspark 中创建具有两个 dataframe 列的字典？

Question

I have a dataframe with two columns that looks as follows:我有一个 dataframe 有两列，如下所示：

    df = spark.createDataFrame([('A', 'Science'),
 ('A', 'Math'),
 ('A', 'Physics'),
 ('B', 'Science'),
 ('B', 'English'),
 ('C', 'Math'),
 ('C', 'English'),
 ('C', 'Latin')],
 ['Group', 'Subjects'])


Group   Subjects
A       Science
A       Math
A       Physics
B       Science
B       English
C       Math
C       English
C       Latin

I need to iterate through this data for each unique value in Group column and perform some processing.我需要为 Group 列中的每个唯一值遍历这些数据并执行一些处理。 I'm thinking of creating a dictionary with the each Group name as the key and their corresponding list of Subjects as the value.我正在考虑创建一个字典，其中每个组名称作为键，它们对应的主题列表作为值。

So, my expected output would look like below:所以，我预期的 output 如下所示：

{A:['Science', 'Math', 'Physics'], B:['Science', 'English'], C:['Math', 'English', 'Latin']}

How to achieve this in pyspark?如何在 pyspark 中实现这一点？

Answer 1

Check this out: You can do groupBy and use collect_list .看看这个：你可以做groupBy并使用collect_list 。

    #Input DF
    # +-----+-------+
    # |group|subject|
    # +-----+-------+
    # |    A|   Math|
    # |    A|Physics|
    # |    B|Science|
    # +-----+-------+

    df1 = df.groupBy("group").agg(F.collect_list("subject").alias("subject")).orderBy("group")

    df1.show(truncate=False)

    # +-----+---------------+
    # |group|subject        |
    # +-----+---------------+
    # |A    |[Math, Physics]|
    # |B    |[Science]      |
    # +-----+---------------+

    dict = {row['group']:row['subject'] for row in df1.collect()}

    print(dict)

    # {'A': ['Math', 'Physics'], 'B': ['Science']}

Answer 2

You can use collect_set incase you need unique subjects, else collect_list.如果您需要独特的主题，则可以使用 collect_set，否则使用 collect_list。

import pyspark.sql.functions as F
 df = spark.createDataFrame([('A', 'Science'),
 ('A', 'Math'),
 ('A', 'Physics'),
 ('B', 'Science'),
 ('B', 'English'),
 ('C', 'Math'),
 ('C', 'English'),
 ('C', 'Latin')],
 ['Group', 'Subjects'])
 
 df_tst=df.groupby('Group').agg(F.collect_set("Subjects").alias('Subjects')).withColumn("dict",F.create_map('Group',"Subjects"))

results:结果：

+-----+------------------------+-------------------------------+
|Group|Subjects                |dict                           |
+-----+------------------------+-------------------------------+
|C    |[Math, Latin, English]  |[C -> [Math, Latin, English]]  |
|B    |[Science, English]      |[B -> [Science, English]]      |
|A    |[Math, Physics, Science]|[A -> [Math, Physics, Science]]|
+-----+------------------------+-------------------------------+

如何在 pyspark 中创建具有两个 dataframe 列的字典？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-07-01 13:16:47

解决方案2
0 2020-07-01 13:37:12

如何在 pyspark 中创建具有两个 dataframe 列的字典？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-07-01 13:16:47

解决方案2 0 2020-07-01 13:37:12

解决方案1
1 已采纳 2020-07-01 13:16:47

解决方案2
0 2020-07-01 13:37:12