简体   繁体   English

展平 pyspark 中的嵌套 json scala 代码

[英]flatten nested json scala code in pyspark

Trying to do the following scala code but in pyspark:尝试执行以下 scala 代码,但在 pyspark 中:

val maxJsonParts = 3 // whatever that number is...
val jsonElements = (0 until maxJsonParts)
                     .map(i => get_json_object($"Payment", s"$$[$i]"))

val newDF = dataframe
  .withColumn("Payment", explode(array(jsonElements: _*)))
  .where(!isnull($"Payment"))

For example, I am trying to make a nested column such as in the payment column below:例如,我正在尝试制作一个嵌套列,例如在下面的付款列中:

id ID name姓名 payment支付
1 1 James詹姆士 [ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ] [ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]

to become:成为:

id ID name姓名 payment支付
1 1 James詹姆士 {"@id": 1, "currency":"GBP"} {"@id": 1, "货币":"GBP"}
1 1 James詹姆士 {"@id":2, "currency":"USD"} {"@id":2, "货币":"USD"}

The table schema:表架构:

root
|-- id: integer (nullable = true)
|-- Name: string (nullable = true)   
|-- Payment: string (nullable = true)

I tried writing this in Pyspark but its just turning the nested column (payment) to null:我尝试在 Pyspark 中编写此代码,但它只是将嵌套列(付款)转换为 null:

lst = [range(0,10)]
jsonElem = [F.get_json_object(F.col("payment"), f"$[{i}]") for i in lst]
bronzeDF = bronzeDF.withColumn("payment2", F.explode(F.array(*jsonElem)))
bronzeDF.show()

Any help is highly appreciated.非常感谢任何帮助。

Here is another approach which allows you to parse the given JSON based on the right schema in order to generate the payment array.这是另一种方法,它允许您基于正确的模式解析给定的 JSON 以生成支付数组。 The solution is based on from_json function which allows you to parse a string JSON into struct type.该解决方案基于from_json function ,它允许您将字符串 JSON 解析为结构类型。

from pyspark.sql.types import IntegerType, StringType, ArrayType, StructField
from pyspark.sql.functions import from_json, explode

data = [
  (1, 'James', '[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]'), 
  (2, 'Tonny', '[ {"@id": 3, "currency":"EUR"},{"@id": 4, "currency": "USD"} ]'), 
]
df = spark.createDataFrame(data, ['id', 'name', 'payment'])

str_schema = 'array<struct<`@id`:int,`currency`:string>>'

# st_schema = ArrayType(StructType([
#                 StructField('@id', IntegerType()),
#                 StructField('currency', StringType())]))

df = df.withColumn("payment", explode(from_json(df["payment"], str_schema)))

df.show()

# +---+-----+--------+
# | id| name| payment|
# +---+-----+--------+
# |  1|James|[1, GBP]|
# |  1|James|[2, USD]|
# |  2|Tonny|[3, EUR]|
# |  2|Tonny|[4, USD]|
# +---+-----+--------+

Note: as you can see you can choose between the string representation of the schema or ArrayType .注意:如您所见,您可以在模式的字符串表示或ArrayType之间进行选择。 Both should produce the same results.两者都应该产生相同的结果。

I came to the solution:我找到了解决方案:

first convert the column to a string type as follows:首先将列转换为字符串类型,如下所示:

bronzeDF = bronzeDF.withColumn("payment2", F.to_json("payment")).drop("payment")

Then you can perform the following code on the column to stack the n nested json objects as separate rows with the same outer key values:然后您可以在列上执行以下代码以将 n 个嵌套的 json 对象堆叠为具有相同外部键值的单独行:

max_json_parts = 50
lst = [f for f in range(0, max_json_parts, 1)]
jsonElem = [F.get_json_object(F.col("payment2"), f"$[{i}]") for i in lst]
bronzeDF = bronzeDF.withColumn("payment2", F.explode(F.array(*jsonElem))).where(F.col("payment2").isNotNull())

Then convert back to struct with a defined schema and and explode the object keys as separate columns:然后转换回具有定义架构的结构,并将 object 键分解为单独的列:

bronzeDF = bronzeDF.withColumn("temp", F.from_json("payment2", jsonSchemaPayment)).select("*", "temp.*").drop("payment2")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM