簡體   English   中英

多態JSON的SPARK處理

[英]SPARK processing of polymorphic JSON

考慮這個 JSON 輸入(為了便於閱讀,以多行形式顯示,但實際輸入文檔是單行 CR 分隔的):

{
  "common": { "type":"A", "date":"2020-01-01T12:00:00" },
  "data": {
    "name":"Dave",
    "pets": [ "dog", "cat" ]
  }
}
{
  "common": { "type": "B", "date":"2020-01-01T12:00:00" },
  "data": {
    "whatever": { "X": {"foo":3}, "Y":"bar" },
    "favoriteInts": [ 0, 1, 7]
  }
}

我熟悉json-schema以及我可以描述data子結構可以是name,petswhatever,favoriteInts的方式。 我們使用common.type字段來運行時識別類型。

這在 SPARK 模式定義中是否可行? 初步實驗如下:

    schema = StructType([
        StructField("common", StructType(common_schema)), # .. because the type is consistent                                       
        StructField("data", StructType())  # attempting to declare a "generic" struct
    ])
    df = spark.read.option("multiline", "true").json(source, schema)

不工作; 在讀取data結構包含除此特定示例 2 字段之外的任何內容時,我們得到:

+--------------------+----+                                                     
|              common|data|
+--------------------+----+
|{2020-01-01T12:00...|  {}|
+--------------------+----+

並嘗試提取任何命名字段會產生No such struct field <whatever> 將“通用結構”排除在schema def 之外完全會產生一個 dataframe 沒有任何名為data的字段,更不用說其中的字段了。

除此之外,我最終尋求做這樣的事情:

df = spark.read.json(source)

def processA(frame):
    frame.select( frame.data.name )  # we KNOW name exists for type A
    ...

def processB(frame):
    frame.select( frame.data.favoriteInts )  # we KNOW favoriteInts exists for type B
    ...

processA( df.filter(df.common.type == "A") )
processB( df.filter(df.common.type == "B") )

您可以在結構中使用嵌套和可為空的類型(通過指定True )來適應不確定性。

from pyspark.sql.types import StructType, StringType, ArrayType, StructField, IntegerType

data_schema = StructType([
    # Type A related attributes
    StructField("name",StringType(),True), # True implies nullable
    StructField("pets",ArrayType(StringType()),True),

   # Type B related attributes
    StructField("whatever",StructType([
        StructField("X",StructType([
            StructField("foo",IntegerType(),True)
        ]),True),
        StructField("Y",StringType(),True)
    ]),True), # True implies nullable
    StructField("favoriteInts",ArrayType(IntegerType()),True),
])
schema = StructType([
        StructField("common", StructType(common_schema)), # .. because the type is consistent                                       
        StructField("data", data_schema)  
])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM