PySpark：从字符串类型的列中读取嵌套的JSON并创建列

Question

I have a dataframe in PySpark with 3 columns - json, date and object_id:我在 PySpark 中有一个 dataframe，有 3 列 - json，日期和 object_id：

-----------------------------------------------------------------------------------------
|json                                                              |date      |object_id|
-----------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-01|xyz123   |
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-02|xyz123   |
|{'g':{'h':0,'j':{'50':0.005,'80':0,'100':0},'d':0.02}}            |2020-08-03|xyz123   |
-----------------------------------------------------------------------------------------

Now I have a list of variables: [a.c.60, an60, ad, gh].现在我有一个变量列表：[a.c.60, an60, ad, gh]。 I need to extract only these variables from the json column of above mentioned dataframe and to add those variables as columns in the dataframe with their respective values.我只需要从上述 dataframe 的 json 列中提取这些变量，并将这些变量添加为 dataframe 中的列及其各自的值。

So in the end, the dataframe should look like:所以最后，dataframe 应该是这样的：

-------------------------------------------------------------------------------------------------------
|json                                                    |date      |object_id|a.c.60|a.n.60|a.d |g.h|
-------------------------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-01|xyz123   |0     |null  |0.01|null|
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-02|xyz123   |null  |0     |0.01|null|
|{'g':{'h':0,'j':{'k':0.005,'':0,'100':0},'d':0.01}}     |2020-08-03|xyz123   |null  |null  |0.02|0   |
-------------------------------------------------------------------------------------------------------

Please help to get this result dataframe. The main problem I am facing is due to no fixed structure for the incoming json data.请帮助获得此结果 dataframe。我面临的主要问题是由于传入的 json 数据没有固定结构。 The json data can be anything in nested form but I need to extract only the given four variables. json 数据可以是嵌套形式的任何数据，但我只需要提取给定的四个变量。 I have achieved this in Pandas by flattening out the json string and then to extract the 4 variables but in Spark it is getting difficult.我在 Pandas 中实现了这一点，方法是展平 json 字符串，然后提取 4 个变量，但在 Spark 中它变得越来越困难。

Answer 1

There are 2 ways to do it:有两种方法可以做到：

use the get_json_object function , like this:使用get_json_object function ，如下所示：

import pyspark.sql.functions as F

df = spark.createDataFrame(['{"a":{"b":0,"c":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"a":{"m":0,"n":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"g":{"h":0,"j":{"50":0.005,"80":0,"100":0},"d":0.02}}'],
                           StringType())

df3 = df.select(F.get_json_object(F.col("value"), "$.a.c.60").alias("a_c_60"),
                F.get_json_object(F.col("value"), "$.a.n.60").alias("a_n_60"),
                F.get_json_object(F.col("value"), "$.a.d").alias("a_d"),
                F.get_json_object(F.col("value"), "$.g.h").alias("g_h"))

will give:会给：

>>> df3.show()
+------+------+----+----+
|a_c_60|a_n_60| a_d| g_h|
+------+------+----+----+
|     0|  null|0.01|null|
|  null|     0|0.01|null|
|  null|  null|null|   0|
+------+------+----+----+

Declare schema explicitly (only necessary fields), convert JSON into structus using the from_json function with the schema, and then extract individual values from structures - this could be more performant than JSON Path:显式声明模式（仅必要字段），使用带有模式的from_json function将 JSON 转换为结构，然后从结构中提取单个值——这可能比 JSON 路径更高效：

from pyspark.sql.types import *
import pyspark.sql.functions as F

aSchema = StructType([
    StructField("c", StructType([
        StructField("60", DoubleType(), True)
    ]), True),
    StructField("n", StructType([
        StructField("60", DoubleType(), True)
    ]), True),
    StructField("d", DoubleType(), True),
])
gSchema = StructType([
    StructField("h", DoubleType(), True)
])

schema = StructType([
    StructField("a", aSchema, True),
    StructField("g", gSchema, True)
])

df = spark.createDataFrame(['{"a":{"b":0,"c":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"a":{"m":0,"n":{"50":0.005,"60":0,"100":0},"d":0.01,"e":0,"f":2}}',
                            '{"g":{"h":0,"j":{"50":0.005,"80":0,"100":0},"d":0.02}}'],
                           StringType())

df2 = df.select(F.from_json("value", schema=schema).alias('data')).select('data.*')
df2.select(df2.a.c['60'], df2.a.n['60'], df2.a.d, df2.g.h).show()

will give会给

+------+------+----+----+
|a.c.60|a.n.60| a.d| g.h|
+------+------+----+----+
|   0.0|  null|0.01|null|
|  null|   0.0|0.01|null|
|  null|  null|null| 0.0|
+------+------+----+----+

PySpark：从字符串类型的列中读取嵌套的JSON并创建列

问题描述

1 个解决方案

解决方案1
13 已采纳 2020-08-20 11:07:04

PySpark：从字符串类型的列中读取嵌套的JSON并创建列

问题描述

1 个解决方案

解决方案1 13 已采纳 2020-08-20 11:07:04

解决方案1
13 已采纳 2020-08-20 11:07:04