pyspark dataframe pivot 一個 json 列到新列

Question

我想通過 python3 從 pyspark dataframe 中的 json 列中提取數據。

我的 dataframe：

  year month p_name json_col 
  2010 05    rchsc  [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"} ]

我需要一個 dataframe 像：

 year month p_name in_market weight color 
 2010 05    rchsc  yes       12.56  red

我努力了

 from pyspark.sql.functions import from_json, col
 from pyspark.sql.types import StructType, StructField, StringType

 schema = StructType(
   [
     StructField('attri_name', StringType(), True),
    StructField('value', StringType(), True)
   ]
 )
 df.withColumn("new_col", from_json("json_col", schema))

但是，不會創建新列。 我不確定如何將 json 列和 pivot 分解為新列。

Answer 1

使用ArrayType定義schema ，因為您在 json 中有數組，然后pivot explode 。

Example:

df.show()
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|year|month|p_name|json_col                                                                                                                            |
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|2010|05   |rchsc |[{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}]|
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = ArrayType(StructType(
   [
     StructField('attri_name', StringType(), True),
    StructField('value', StringType(), True)
   ]
 ))

df.withColumn("ff",from_json(col("json_col"),schema)).\
selectExpr("*","explode(ff)").\
select("*","col.*").\
drop(*["json_col","ff","col"]).\
groupBy("year","month","p_name").\
pivot("attri_name").\
agg(first(col("value"))).\
show()
#+----+-----+------+-----+---------+------+
#|year|month|p_name|color|in_market|weight|
#+----+-----+------+-----+---------+------+
#|2010|   05| rchsc|  red|      yes| 12.56|
#+----+-----+------+-----+---------+------+

Answer 2

看一下這個。 您可以使用input data預先定義schema ，並使用explode來炸毀數組並使用 pivot 並從結構中獲取元素以創建新列。

        from pyspark.sql import SparkSession
        from pyspark.sql import functions as F
        from pyspark.sql.types import StructType,StructField,StringType,IntegerType,ArrayType

        spark = SparkSession.builder \
            .appName('SO')\
            .getOrCreate()

        spark = SparkSession.builder.getOrCreate()

        schema = StructType([
          StructField("year", IntegerType(), True),
          StructField("month", IntegerType(),  True),
          StructField("p_name", StringType(), True),
          StructField("json_col", ArrayType(StructType([StructField("attri_name", StringType(), True),
                                                        StructField("value", StringType(), True)])))

        ])

        data = [(2010, 5, "rchsc", [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}])]

        df = spark.createDataFrame(data,schema)

        df.show(truncate=False)

        # +----+-----+------+-------------------------------------------------+
        # |year|month|p_name|json_col                                         |
        # +----+-----+------+-------------------------------------------------+
        # |2010|5    |rchsc |[[in_market, yes], [weight, 12.56], [color, red]]|
        # +----+-----+------+-------------------------------------------------+



        df1 = df.select("year","month", "p_name", F.explode("json_col"))

        df2 = df1.groupBy("year", "month", "p_name").pivot("col.attri_name").agg(F.first("col.value"))

        df2.show()

        # +----+-----+------+-----+---------+------+
        # |year|month|p_name|color|in_market|weight|
        # +----+-----+------+-----+---------+------+
        # |2010|    5| rchsc|  red|      yes| 12.56|
        # +----+-----+------+-----+---------+------+

pyspark dataframe pivot 一個 json 列到新列

問題描述

2 個解決方案

解決方案1
2 已采納 2020-07-28 00:03:23

解決方案2
1 2020-07-28 00:34:37

pyspark dataframe pivot 一個 json 列到新列

問題描述

2 個解決方案

解決方案1 2 已采納 2020-07-28 00:03:23

解決方案2 1 2020-07-28 00:34:37

解決方案1
2 已采納 2020-07-28 00:03:23

解決方案2
1 2020-07-28 00:34:37