簡體   English   中英

將多個列表列轉換為 pyspark 中 dataframe 中的 json 數組列

[英]Convert multiple list columns to json array column in dataframe in pyspark

我有一個數據框,它有多個列表列並轉換一個 JSON 數組列。

使用低於邏輯但沒有任何想法?

def test(test1,test2):
    d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
    return d

UDF 定義為如下的數組類型並嘗試使用列調用但沒有解決任何想法?

arrayToMapUDF = udf(test ,ArrayType(StringType()))

df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))
分數 成績
[100、150、200、300、400] [0.01, 0.02, 0.03, 0.04, 0.05]

需要轉換如下。

分數 成績 Json 數組列
[100、150、200、300、400] [0.01, 0.02, 0.03, 0.04, 0.05] {屬性:[{標記:1000,
等級:0.01},
{標記:15000,
等級:0.02},
{標記:2000,
成績:0.03}
]}

您可以使用StringType因為它返回的是 JSON 字符串,而不是字符串數組。 您還可以使用json.dumps將字典轉換為 JSON 字符串。

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return json.dumps(d)

arrayToMapUDF = F.udf(test, StringType())

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                                      |
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount": 1000, "discount": 0.01}, {"amount": 15000, "discount": 0.02}, {"amount": 2000, "discount": 0.03}, {"amount": 3000, "discount": 0.04}, {"amount": 4000, "discount": 0.05}]|
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

如果你不想要引號,

import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return json.dumps(d).replace('"', '')

arrayToMapUDF = F.udf(test, StringType())

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                  |
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{amount: 1000, discount: 0.01}, {amount: 15000, discount: 0.02}, {amount: 2000, discount: 0.03}, {amount: 3000, discount: 0.04}, {amount: 4000, discount: 0.05}]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

如果你想要一個真正的 JSON 類型的列:

def test(test1,test2):
    d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
    return d

arrayToMapUDF = F.udf(test, 
    ArrayType(
        StructType([
            StructField('amount', StringType()), 
            StructField('discount', StringType())
        ])
    )
)

df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))

df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                        |
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[[1000, 0.01], [15000, 0.02], [2000, 0.03], [3000, 0.04], [4000, 0.05]]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------+

df2.printSchema()
root
 |-- amount: array (nullable = false)
 |    |-- element: integer (containsNull = false)
 |-- discount: array (nullable = false)
 |    |-- element: double (containsNull = false)
 |-- jsonarraycolumn: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: string (nullable = true)
 |    |    |-- discount: string (nullable = true)

為避免使用 udf 函數,您可以使用高階函數

import pyspark.sql.functions as f

transform_expr = "TRANSFORM(arrays_zip(amount, discount), value -> value)"
df = df.withColumn('jsonarraycolumn', f.to_json(f.expr(transform_expr)))

df.show(truncate=False)

Output:

+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount                         |discount                      |jsonarraycolumn                                                                                                                                                             |
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount":1000.0,"discount":0.01},{"amount":15000.0,"discount":0.02},{"amount":2000.0,"discount":0.03},{"amount":3000.0,"discount":0.04},{"amount":4000.0,"discount":0.05}]|
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM