簡體   English   中英

pyspark dataframe pivot 一個 json 列到新列

[英]pyspark dataframe pivot a json column to new columns

我想通過 python3 從 pyspark dataframe 中的 json 列中提取數據。

我的 dataframe:

  year month p_name json_col 
  2010 05    rchsc  [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"} ]

我需要一個 dataframe 像:

 year month p_name in_market weight color 
 2010 05    rchsc  yes       12.56  red

我努力了

 from pyspark.sql.functions import from_json, col
 from pyspark.sql.types import StructType, StructField, StringType

 schema = StructType(
   [
     StructField('attri_name', StringType(), True),
    StructField('value', StringType(), True)
   ]
 )
 df.withColumn("new_col", from_json("json_col", schema))

但是,不會創建新列。 我不確定如何將 json 列和 pivot 分解為新列。

使用ArrayType定義schema ,因為您在 json 中有數組,然后pivot explode

Example:

df.show()
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|year|month|p_name|json_col                                                                                                                            |
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|2010|05   |rchsc |[{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}]|
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = ArrayType(StructType(
   [
     StructField('attri_name', StringType(), True),
    StructField('value', StringType(), True)
   ]
 ))

df.withColumn("ff",from_json(col("json_col"),schema)).\
selectExpr("*","explode(ff)").\
select("*","col.*").\
drop(*["json_col","ff","col"]).\
groupBy("year","month","p_name").\
pivot("attri_name").\
agg(first(col("value"))).\
show()
#+----+-----+------+-----+---------+------+
#|year|month|p_name|color|in_market|weight|
#+----+-----+------+-----+---------+------+
#|2010|   05| rchsc|  red|      yes| 12.56|
#+----+-----+------+-----+---------+------+

看一下這個。 您可以使用input data預先定義schema ,並使用explode來炸毀數組並使用 pivot 並從結構中獲取元素以創建新列。

        from pyspark.sql import SparkSession
        from pyspark.sql import functions as F
        from pyspark.sql.types import StructType,StructField,StringType,IntegerType,ArrayType

        spark = SparkSession.builder \
            .appName('SO')\
            .getOrCreate()

        spark = SparkSession.builder.getOrCreate()

        schema = StructType([
          StructField("year", IntegerType(), True),
          StructField("month", IntegerType(),  True),
          StructField("p_name", StringType(), True),
          StructField("json_col", ArrayType(StructType([StructField("attri_name", StringType(), True),
                                                        StructField("value", StringType(), True)])))

        ])

        data = [(2010, 5, "rchsc", [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}])]

        df = spark.createDataFrame(data,schema)

        df.show(truncate=False)

        # +----+-----+------+-------------------------------------------------+
        # |year|month|p_name|json_col                                         |
        # +----+-----+------+-------------------------------------------------+
        # |2010|5    |rchsc |[[in_market, yes], [weight, 12.56], [color, red]]|
        # +----+-----+------+-------------------------------------------------+



        df1 = df.select("year","month", "p_name", F.explode("json_col"))

        df2 = df1.groupBy("year", "month", "p_name").pivot("col.attri_name").agg(F.first("col.value"))

        df2.show()

        # +----+-----+------+-----+---------+------+
        # |year|month|p_name|color|in_market|weight|
        # +----+-----+------+-----+---------+------+
        # |2010|    5| rchsc|  red|      yes| 12.56|
        # +----+-----+------+-----+---------+------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM