[英]Transform nested dictionary key values to pyspark dataframe
from pyspark.sql import functions as F
df.show() #sample dataframe
+---------+----------------------------------------------------------------------------------------------------------+
|timestmap|dic |
+---------+----------------------------------------------------------------------------------------------------------+
|timestamp|{"Name":"David","Age":"25","Location":"New York","Height":"170","fields":{"Color":"Blue","Shape":"round"}}|
+---------+----------------------------------------------------------------------------------------------------------+
對於Spark2.4+
,您可以使用from_json
和schema_of_json
。
schema=df.select(F.schema_of_json(df.select("dic").first()[0])).first()[0]
df.withColumn("dic", F.from_json("dic", schema))\
.selectExpr("dic.*").selectExpr("*","fields.*").drop("fields").show()
#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25| 170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+
如果您沒有spark2.4
,您也可以將rdd
方式與read.json
一起使用。 df to rdd
的轉換會對性能造成影響。
df1 = spark.read.json(df.rdd.map(lambda r: r.dic))\
df1.select(*[x for x in df1.columns if x!='fields'], F.col("fields.*")).show()
#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25| 170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.