pyspark dataframe pivot 一个 json 列到新列

Question

I would like to extract data from a json column in pyspark dataframe by python3.我想通过 python3 从 pyspark dataframe 中的 json 列中提取数据。

My dataframe:我的 dataframe：

  year month p_name json_col 
  2010 05    rchsc  [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"} ]

I need a dataframe like:我需要一个 dataframe 像：

 year month p_name in_market weight color 
 2010 05    rchsc  yes       12.56  red

I have tried我努力了

 from pyspark.sql.functions import from_json, col
 from pyspark.sql.types import StructType, StructField, StringType

 schema = StructType(
   [
     StructField('attri_name', StringType(), True),
    StructField('value', StringType(), True)
   ]
 )
 df.withColumn("new_col", from_json("json_col", schema))

But, no new columns are created.但是，不会创建新列。 I am not sure how to decompose the json column and pivot them to new columns.我不确定如何将 json 列和 pivot 分解为新列。

Answer 1

Define schema with ArrayType as you have array in json, then explode and pivot the columns.使用ArrayType定义schema ，因为您在 json 中有数组，然后pivot explode 。

Example:

df.show()
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|year|month|p_name|json_col                                                                                                                            |
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|2010|05   |rchsc |[{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}]|
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = ArrayType(StructType(
   [
     StructField('attri_name', StringType(), True),
    StructField('value', StringType(), True)
   ]
 ))

df.withColumn("ff",from_json(col("json_col"),schema)).\
selectExpr("*","explode(ff)").\
select("*","col.*").\
drop(*["json_col","ff","col"]).\
groupBy("year","month","p_name").\
pivot("attri_name").\
agg(first(col("value"))).\
show()
#+----+-----+------+-----+---------+------+
#|year|month|p_name|color|in_market|weight|
#+----+-----+------+-----+---------+------+
#|2010|   05| rchsc|  red|      yes| 12.56|
#+----+-----+------+-----+---------+------+

Answer 2

check this out.看一下这个。 you can define a schema upfront with the input data and use explode to blow the array and use pivot and grab the elements from struct to make new columns.您可以使用input data预先定义schema ，并使用explode来炸毁数组并使用 pivot 并从结构中获取元素以创建新列。

        from pyspark.sql import SparkSession
        from pyspark.sql import functions as F
        from pyspark.sql.types import StructType,StructField,StringType,IntegerType,ArrayType

        spark = SparkSession.builder \
            .appName('SO')\
            .getOrCreate()

        spark = SparkSession.builder.getOrCreate()

        schema = StructType([
          StructField("year", IntegerType(), True),
          StructField("month", IntegerType(),  True),
          StructField("p_name", StringType(), True),
          StructField("json_col", ArrayType(StructType([StructField("attri_name", StringType(), True),
                                                        StructField("value", StringType(), True)])))

        ])

        data = [(2010, 5, "rchsc", [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}])]

        df = spark.createDataFrame(data,schema)

        df.show(truncate=False)

        # +----+-----+------+-------------------------------------------------+
        # |year|month|p_name|json_col                                         |
        # +----+-----+------+-------------------------------------------------+
        # |2010|5    |rchsc |[[in_market, yes], [weight, 12.56], [color, red]]|
        # +----+-----+------+-------------------------------------------------+



        df1 = df.select("year","month", "p_name", F.explode("json_col"))

        df2 = df1.groupBy("year", "month", "p_name").pivot("col.attri_name").agg(F.first("col.value"))

        df2.show()

        # +----+-----+------+-----+---------+------+
        # |year|month|p_name|color|in_market|weight|
        # +----+-----+------+-----+---------+------+
        # |2010|    5| rchsc|  red|      yes| 12.56|
        # +----+-----+------+-----+---------+------+

pyspark dataframe pivot 一个 json 列到新列

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-07-28 00:03:23

解决方案2
1 2020-07-28 00:34:37

pyspark dataframe pivot 一个 json 列到新列

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-07-28 00:03:23

解决方案2 1 2020-07-28 00:34:37

解决方案1
2 已采纳 2020-07-28 00:03:23

解决方案2
1 2020-07-28 00:34:37