[英]pyspark dataframe pivot a json column to new columns
我想通過 python3 從 pyspark dataframe 中的 json 列中提取數據。
我的 dataframe:
year month p_name json_col
2010 05 rchsc [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"} ]
我需要一個 dataframe 像:
year month p_name in_market weight color
2010 05 rchsc yes 12.56 red
我努力了
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('attri_name', StringType(), True),
StructField('value', StringType(), True)
]
)
df.withColumn("new_col", from_json("json_col", schema))
但是,不會創建新列。 我不確定如何將 json 列和 pivot 分解為新列。
使用ArrayType
定義schema
,因為您在 json 中有數組,然后pivot
explode
。
Example:
df.show()
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|year|month|p_name|json_col |
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|2010|05 |rchsc |[{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}]|
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = ArrayType(StructType(
[
StructField('attri_name', StringType(), True),
StructField('value', StringType(), True)
]
))
df.withColumn("ff",from_json(col("json_col"),schema)).\
selectExpr("*","explode(ff)").\
select("*","col.*").\
drop(*["json_col","ff","col"]).\
groupBy("year","month","p_name").\
pivot("attri_name").\
agg(first(col("value"))).\
show()
#+----+-----+------+-----+---------+------+
#|year|month|p_name|color|in_market|weight|
#+----+-----+------+-----+---------+------+
#|2010| 05| rchsc| red| yes| 12.56|
#+----+-----+------+-----+---------+------+
看一下這個。 您可以使用input data
預先定義schema
,並使用explode
來炸毀數組並使用 pivot 並從結構中獲取元素以創建新列。
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,ArrayType
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField("year", IntegerType(), True),
StructField("month", IntegerType(), True),
StructField("p_name", StringType(), True),
StructField("json_col", ArrayType(StructType([StructField("attri_name", StringType(), True),
StructField("value", StringType(), True)])))
])
data = [(2010, 5, "rchsc", [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}])]
df = spark.createDataFrame(data,schema)
df.show(truncate=False)
# +----+-----+------+-------------------------------------------------+
# |year|month|p_name|json_col |
# +----+-----+------+-------------------------------------------------+
# |2010|5 |rchsc |[[in_market, yes], [weight, 12.56], [color, red]]|
# +----+-----+------+-------------------------------------------------+
df1 = df.select("year","month", "p_name", F.explode("json_col"))
df2 = df1.groupBy("year", "month", "p_name").pivot("col.attri_name").agg(F.first("col.value"))
df2.show()
# +----+-----+------+-----+---------+------+
# |year|month|p_name|color|in_market|weight|
# +----+-----+------+-----+---------+------+
# |2010| 5| rchsc| red| yes| 12.56|
# +----+-----+------+-----+---------+------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.