DataFrame - ValueError：具有StructType的意外元組

Question

我正在嘗試為dataframe創建手動模式。 我傳入的數據是從json創建的RDD。 這是我的初始數據：

json2 = sc.parallelize(['{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}'])

那么這里是如何指定模式：

schema = StructType(fields=[
    StructField(
        name='name',
        dataType=StringType(),
        nullable=True
    ),
    StructField(
        name='pandas',
        dataType=ArrayType(
            StructType(
                fields=[
                    StructField(
                        name='id',
                        dataType=StringType(),
                        nullable=False
                    ),
                    StructField(
                        name='zip',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='pt',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='happy',
                        dataType=BooleanType(),
                        nullable=False
                    ),
                    StructField(
                        name='attributes',
                        dataType=ArrayType(
                            elementType=DoubleType(),
                            containsNull=False
                        ),
                        nullable=True

                    )
                ]
            ),
            containsNull=True
        ),
        nullable=True
    )
])

當我使用sqlContext.createDataFrame(json2, schema)然后嘗試對結果dataframe執行show()時，我收到以下錯誤：

ValueError: Unexpected tuple '{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}' with StructType

Answer 1

首先， json2只是一個RDD[String] 。 Spark對用於編碼數據的序列化格式沒有特別的了解。 此外，它預計RDD或Row或某些產品，顯然不是這樣。

在Scala你可以使用

sqlContext.read.schema(schema).json(rdd)

使用RDD[String]但有兩個問題：

在PySpark中無法直接訪問此方法
即使它是你創建的模式也是無效的：
- pandas是一個struct not和array
- pandas.happy不是一個boolean string
- pandas.attributes是string而不是array

模式僅用於避免類型推斷，而不用於類型轉換或任何其他轉換。 如果要轉換數據，則必須先解析它：

def parse(s: str) -> Row:
    return ...

rdd.map(parse).toDF(schema)

假設你有這樣的JSON（固定類型）：

{"name": "mission", "pandas": {"attributes": [0.4, 0.5], "pt": "giant", "id": "1", "zip": "94110", "happy": true}}

正確的架構如下所示

StructType([
    StructField("name", StringType(), True),
    StructField("pandas", StructType([
        StructField("attributes", ArrayType(DoubleType(), True), True),
        StructField("happy", BooleanType(), True),
        StructField("id", StringType(), True),
        StructField("pt", StringType(), True),
        StructField("zip", StringType(), True))],
    True)])

DataFrame - ValueError：具有StructType的意外元組

問題描述

1 個解決方案

解決方案1
2 已采納 2016-05-23 20:14:18

DataFrame - ValueError：具有StructType的意外元組

問題描述

1 個解決方案

解決方案1 2 已采納 2016-05-23 20:14:18

解決方案1
2 已采納 2016-05-23 20:14:18