将 python 字典转换为 pyspark dataframe

Question

I have a json file which contains a dictionary in the following format:我有一个 json 文件，其中包含以下格式的字典：

{"a1":{"b1":["c1","c2"], "b2":["c4","c3"]}, "a2":{"b3":["c1","c4"]}}

Is it possible to convert this dictionary into a PySpark dataframe as the following?是否可以将这本字典转换成 PySpark dataframe 如下所示？

 col1 |  col2 |  col3
----------------------
| a1  |   b1  |  c1  |
----------------------
| a1  |   b1  |  c2  |
----------------------
| a1  |   b2  |  c4  |
----------------------
| a1  |   b2  |  c3  |
----------------------
| a2  |   b3  |  c1  |
----------------------
| a2  |   b3  |  c4  |

I have seen the standard format of converting json to PySpark dataframe (example in this link ) but was wondering about nested dictionaries that contain lists as well.我已经看到将 json 转换为 PySpark dataframe 的标准格式（此链接中的示例），但想知道是否也包含列表的嵌套字典。

Answer 1

Interesting problem, The main struggle I realized with this problem is your when reading from JSON, your schema is likely has struct type, making it harder to solve, because basically a1 has different type than a2 .有趣的问题，我意识到这个问题的主要斗争是你从 JSON 读取时，你的模式可能具有结构类型，这使得它更难解决，因为基本上a1的类型与a2不同。

My idea is using somehow converting your struct type to map type, then stack them together, then apply a few explode s:我的想法是以某种方式将您的结构类型转换为 map 类型，然后将它们堆叠在一起，然后应用一些explode ：

This is your `df`这是你的`df`

+----------------------------------+
|data                              |
+----------------------------------+
|{{[c1, c2], [c4, c3]}, {[c1, c4]}}|
+----------------------------------+

root
 |-- data: struct (nullable = true)
 |    |-- a1: struct (nullable = true)
 |    |    |-- b1: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- b2: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |-- a2: struct (nullable = true)
 |    |    |-- b3: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

Create a temporary df to handle JSON's first level创建一个临时 df 来处理 JSON 的第一层

first_level_df = df.select('data.*')
first_level_df.show()
first_level_cols = first_level_df.columns # ['a1', 'a2']

+--------------------+----------+
|                  a1|        a2|
+--------------------+----------+
|{[c1, c2], [c4, c3]}|{[c1, c4]}|
+--------------------+----------+

Some helper variables一些辅助变量

map_cols = [F.from_json(F.to_json(c), T.MapType(T.StringType(), T.StringType())).alias(c) for c in first_level_cols]
# [Column<'entries AS a1'>, Column<'entries AS a2'>]

stack_cols = ', '.join([f"'{c}', {c}" for c in first_level_cols])
# 'a1', a1, 'a2', a2

Main transformation主要改造

(first_level_df
    .select(map_cols)
    .select(F.expr(f'stack(2, {stack_cols})').alias('AA', 'temp'))
    .select('AA', F.explode('temp').alias('BB', 'temp'))
    .select('AA', 'BB', F.explode(F.from_json('temp', T.ArrayType(T.StringType()))).alias('CC'))
    .show(10, False)
)

+---+---+---+
|AA |BB |CC |
+---+---+---+
|a1 |b1 |c1 |
|a1 |b1 |c2 |
|a1 |b2 |c4 |
|a1 |b2 |c3 |
|a2 |b3 |c1 |
|a2 |b3 |c4 |
+---+---+---+

将 python 字典转换为 pyspark dataframe

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-04-06 21:27:48

This is your `df`这是你的`df`

Create a temporary df to handle JSON's first level创建一个临时 df 来处理 JSON 的第一层

Some helper variables一些辅助变量

Main transformation主要改造

将 python 字典转换为 pyspark dataframe

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-04-06 21:27:48

This is your df这是你的df

Create a temporary df to handle JSON's first level创建一个临时 df 来处理 JSON 的第一层

Some helper variables一些辅助变量

Main transformation主要改造

解决方案1
0 已采纳 2022-04-06 21:27:48

This is your `df`这是你的`df`