[英]pyspark dataframe pivot a json column to new columns
I would like to extract data from a json column in pyspark dataframe by python3.我想通过 python3 从 pyspark dataframe 中的 json 列中提取数据。
My dataframe:我的 dataframe:
year month p_name json_col
2010 05 rchsc [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"} ]
I need a dataframe like:我需要一个 dataframe 像:
year month p_name in_market weight color
2010 05 rchsc yes 12.56 red
I have tried我努力了
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('attri_name', StringType(), True),
StructField('value', StringType(), True)
]
)
df.withColumn("new_col", from_json("json_col", schema))
But, no new columns are created.但是,不会创建新列。 I am not sure how to decompose the json column and pivot them to new columns.
我不确定如何将 json 列和 pivot 分解为新列。
Define schema
with ArrayType
as you have array in json, then explode
and pivot
the columns.使用
ArrayType
定义schema
,因为您在 json 中有数组,然后pivot
explode
。
Example:
df.show()
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|year|month|p_name|json_col |
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
#|2010|05 |rchsc |[{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}]|
#+----+-----+------+------------------------------------------------------------------------------------------------------------------------------------+
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = ArrayType(StructType(
[
StructField('attri_name', StringType(), True),
StructField('value', StringType(), True)
]
))
df.withColumn("ff",from_json(col("json_col"),schema)).\
selectExpr("*","explode(ff)").\
select("*","col.*").\
drop(*["json_col","ff","col"]).\
groupBy("year","month","p_name").\
pivot("attri_name").\
agg(first(col("value"))).\
show()
#+----+-----+------+-----+---------+------+
#|year|month|p_name|color|in_market|weight|
#+----+-----+------+-----+---------+------+
#|2010| 05| rchsc| red| yes| 12.56|
#+----+-----+------+-----+---------+------+
check this out.看一下这个。 you can define a
schema
upfront with the input data
and use explode
to blow the array and use pivot and grab the elements from struct to make new columns.您可以使用
input data
预先定义schema
,并使用explode
来炸毁数组并使用 pivot 并从结构中获取元素以创建新列。
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,ArrayType
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField("year", IntegerType(), True),
StructField("month", IntegerType(), True),
StructField("p_name", StringType(), True),
StructField("json_col", ArrayType(StructType([StructField("attri_name", StringType(), True),
StructField("value", StringType(), True)])))
])
data = [(2010, 5, "rchsc", [{"attri_name": "in_market", "value": "yes"}, {"attri_name": "weight", "value": "12.56"}, {"attri_name" : "color", "value" : "red"}])]
df = spark.createDataFrame(data,schema)
df.show(truncate=False)
# +----+-----+------+-------------------------------------------------+
# |year|month|p_name|json_col |
# +----+-----+------+-------------------------------------------------+
# |2010|5 |rchsc |[[in_market, yes], [weight, 12.56], [color, red]]|
# +----+-----+------+-------------------------------------------------+
df1 = df.select("year","month", "p_name", F.explode("json_col"))
df2 = df1.groupBy("year", "month", "p_name").pivot("col.attri_name").agg(F.first("col.value"))
df2.show()
# +----+-----+------+-----+---------+------+
# |year|month|p_name|color|in_market|weight|
# +----+-----+------+-----+---------+------+
# |2010| 5| rchsc| red| yes| 12.56|
# +----+-----+------+-----+---------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.