简体   繁体   English

将 python 字典转换为 pyspark dataframe

[英]convert python dictionary into pyspark dataframe

I have a json file which contains a dictionary in the following format:我有一个 json 文件,其中包含以下格式的字典:

{"a1":{"b1":["c1","c2"], "b2":["c4","c3"]}, "a2":{"b3":["c1","c4"]}}

Is it possible to convert this dictionary into a PySpark dataframe as the following?是否可以将这本字典转换成 PySpark dataframe 如下所示?

 col1 |  col2 |  col3
----------------------
| a1  |   b1  |  c1  |
----------------------
| a1  |   b1  |  c2  |
----------------------
| a1  |   b2  |  c4  |
----------------------
| a1  |   b2  |  c3  |
----------------------
| a2  |   b3  |  c1  |
----------------------
| a2  |   b3  |  c4  |

I have seen the standard format of converting json to PySpark dataframe (example in this link ) but was wondering about nested dictionaries that contain lists as well.我已经看到将 json 转换为 PySpark dataframe 的标准格式(此链接中的示例),但想知道是否也包含列表的嵌套字典。

Interesting problem, The main struggle I realized with this problem is your when reading from JSON, your schema is likely has struct type, making it harder to solve, because basically a1 has different type than a2 .有趣的问题,我意识到这个问题的主要斗争是你从 JSON 读取时,你的模式可能具有结构类型,这使得它更难解决,因为基本上a1的类型与a2不同。

My idea is using somehow converting your struct type to map type, then stack them together, then apply a few explode s:我的想法是以某种方式将您的结构类型转换为 map 类型,然后将它们堆叠在一起,然后应用一些explode

This is your df这是你的df
+----------------------------------+
|data                              |
+----------------------------------+
|{{[c1, c2], [c4, c3]}, {[c1, c4]}}|
+----------------------------------+

root
 |-- data: struct (nullable = true)
 |    |-- a1: struct (nullable = true)
 |    |    |-- b1: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- b2: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |-- a2: struct (nullable = true)
 |    |    |-- b3: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
Create a temporary df to handle JSON's first level创建一个临时 df 来处理 JSON 的第一层
first_level_df = df.select('data.*')
first_level_df.show()
first_level_cols = first_level_df.columns # ['a1', 'a2']

+--------------------+----------+
|                  a1|        a2|
+--------------------+----------+
|{[c1, c2], [c4, c3]}|{[c1, c4]}|
+--------------------+----------+
Some helper variables一些辅助变量
map_cols = [F.from_json(F.to_json(c), T.MapType(T.StringType(), T.StringType())).alias(c) for c in first_level_cols]
# [Column<'entries AS a1'>, Column<'entries AS a2'>]

stack_cols = ', '.join([f"'{c}', {c}" for c in first_level_cols])
# 'a1', a1, 'a2', a2
Main transformation主要改造
(first_level_df
    .select(map_cols)
    .select(F.expr(f'stack(2, {stack_cols})').alias('AA', 'temp'))
    .select('AA', F.explode('temp').alias('BB', 'temp'))
    .select('AA', 'BB', F.explode(F.from_json('temp', T.ArrayType(T.StringType()))).alias('CC'))
    .show(10, False)
)

+---+---+---+
|AA |BB |CC |
+---+---+---+
|a1 |b1 |c1 |
|a1 |b1 |c2 |
|a1 |b2 |c4 |
|a1 |b2 |c3 |
|a2 |b3 |c1 |
|a2 |b3 |c4 |
+---+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM