如何在PySpark中將字符串轉換為字典（JSON）的ArrayType

Question

嘗試將StringType強制轉換為JSON的ArrayType，以生成CSV格式的數據幀。

使用pyspark上Spark2

我正在處理的CSV文件; 如下 -

date,attribute2,count,attribute3
2017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]'
2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]'

如上所示，它在文字字符串中包含一個屬性"attribute3" ，它在技術上是一個完全長度為2的字典（JSON）列表。（這是函數distinct的輸出）

printSchema()片段

attribute3: string (nullable = true)

我試圖將"attribute3"為ArrayType ，如下所示

temp = dataframe.withColumn(
    "attribute3_modified",
    dataframe["attribute3"].cast(ArrayType())
)

 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: __init__() takes at least 2 arguments (1 given)

實際上， ArrayType期望數據類型作為參數。 我試過"json" ，但它沒有用。

期望的輸出 - 最后，我需要將attribute3轉換為ArrayType()或簡單的簡單Python列表。 （我試圖避免使用eval ）

如何將其轉換為ArrayType ，以便將其視為JSON列表？

我在這里錯過了什么嗎？

（文檔，並沒有直接解決這個問題）

Answer 1

使用from_json和一個與attribute3列中的實際數據匹配的模式，將json轉換為ArrayType：

原始數據框：

df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: string (nullable = true)

from pyspark.sql.functions import from_json
from pyspark.sql.types import *

創建架構：

schema = ArrayType(
    StructType([StructField("key", StringType()), 
                StructField("key2", IntegerType())]))

使用from_json ：

df = df.withColumn("attribute3", from_json(df.attribute3, schema))

df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- key: string (nullable = true)
# |    |    |-- key2: integer (nullable = true)

df.show(1, False)
#+----------+----------+-----+------------------------------------+
#|date      |attribute2|count|attribute3                          |
#+----------+----------+-----+------------------------------------+
#|2017-09-03|attribute1|2    |[[value, 2], [value, 2], [value, 2]]|
#+----------+----------+-----+------------------------------------+

Answer 2

@Psidom的答案對我不起作用，因為我使用的是Spark 2.1。

在我的例子中，我不得不稍微修改你的attribute3字符串以將其包裝在字典中：

import pyspark.sql.functions as f
df2 = df.withColumn("attribute3", f.concat(f.lit('{"data": '), "attribute3", f.lit("}")))
df2.select("attribute3").show(truncate=False)
#+--------------------------------------------------------------------------------------+
#|attribute3                                                                            |
#+--------------------------------------------------------------------------------------+
#|{"data": [{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]}|
#+--------------------------------------------------------------------------------------+

現在我可以按如下方式定義模式：

schema = StructType(
    [
        StructField(
            "data",
            ArrayType(
                StructType(
                    [
                        StructField("key", StringType()),
                        StructField("key2", IntegerType())
                    ]
                )
            )
        )
    ]
)

現在使用from_json后跟getItem() ：

df3 = df2.withColumn("attribute3", f.from_json("attribute3", schema).getItem("data"))
df3.show(truncate=False)
#+----------+----------+-----+---------------------------------+
#|date      |attribute2|count|attribute3                       |
#+----------+----------+-----+---------------------------------+
#|2017-09-03|attribute1|2    |[[value,2], [value,2], [value,2]]|
#+----------+----------+-----+---------------------------------+

架構：

df3.printSchema()
# root
# |-- attribute3: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- key: string (nullable = true)
# |    |    |-- key2: integer (nullable = true)

如何在PySpark中將字符串轉換為字典（JSON）的ArrayType

問題描述

2 個解決方案

解決方案1
3 已采納 2018-08-06 19:07:16

解決方案2
3 2018-08-06 19:21:21

如何在PySpark中將字符串轉換為字典（JSON）的ArrayType

問題描述

2 個解決方案

解決方案1 3 已采納 2018-08-06 19:07:16

解決方案2 3 2018-08-06 19:21:21

解決方案1
3 已采納 2018-08-06 19:07:16

解決方案2
3 2018-08-06 19:21:21