简体   繁体   English

如何将 PySpark dataframe 列转换为基于 groupBy 列的字典列表

[英]How to convert PySpark dataframe columns into list of dictionary based on groupBy column

I'm converting dataframe columns into list of dictionary.我正在将 dataframe 列转换为字典列表。

Input dataframe has 3 columns:输入 dataframe 有 3 列:

ID  accounts pdct_code
1    100      IN
1    200      CC
2    300      DD
2    400      ZZ
3    500      AA

I need to read this input dataframe and convert it into 3 output rows.我需要读取此输入 dataframe 并将其转换为 3 output 行。 The output should look like this: output 应如下所示:

ID arrayDict
1  [{“accounts”: 100, “pdct_cd”: ’IN’}, {”accounts”: 200, “pdct_cd”: ’CC’}]

Similarly, for ID "2" there should be 1 row with 2 dictionaries with key value pair.同样,对于 ID“2”,应该有 1 行包含 2 个带有键值对的字典。

I tried this:我试过这个:

Df1 = df.groupBy("ID").agg(collect_list(struct(col("accounts"), ("pdct_cd"))).alias("array_dict"))

But output is not quite as I wanted which should be a list of dictionary.但是 output 不是我想要的,它应该是一个字典列表。

What you described (list of dictionary) doesn't exist in Spark.您描述的内容(字典列表)在 Spark 中不存在。 Instead of lists we have arrays, instead of dictionaries we have structs or maps.我们有 arrays 而不是列表,我们有结构或映射而不是字典。 Since you didn't operate these terms, this will be a loose interpretation of what I think you need.由于您没有使用这些术语,因此这将是对我认为您需要的内容的松散解释。

The following will create arrays of strings.以下将创建 arrays 个字符串。 Those strings will have the structure which you probably want.这些字符串将具有您可能想要的结构。

df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code")))

struct() puts your column inside a struct data type. struct()将您的列放入结构数据类型中。
to_json() creates a JSON string out of the provided struct. to_json()从提供的结构中创建一个 JSON 字符串。
collect_list() is an aggregation function which moves all the strings of the group into an array. collect_list()是一个聚合 function ,它将组中的所有字符串移动到一个数组中。

Full example:完整示例:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 100, "IN"),
     (1, 200, "CC"),
     (2, 300, "DD"),
     (2, 400, "ZZ"),
     (3, 500, "AA")],
    ["ID", "accounts", "pdct_code"])

df = df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code"))).alias("array_dict"))

df.show(truncate=0)
# +---+----------------------------------------------------------------------+
# |ID |array_dict                                                            |
# +---+----------------------------------------------------------------------+
# |1  |[{"accounts":100,"pdct_code":"IN"}, {"accounts":200,"pdct_code":"CC"}]|
# |3  |[{"accounts":500,"pdct_code":"AA"}]                                   |
# |2  |[{"accounts":300,"pdct_code":"DD"}, {"accounts":400,"pdct_code":"ZZ"}]|
# +---+----------------------------------------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM