简体   繁体   中英

Parsing multiline nested json in Spark 3 dataframe using pyspark

I am having trouble reading multiline json in Pyspark. Example:

{
    "_index": "kl.service-log.2021.04.06",
    "_type": "_doc",
    "_id": "hZ3SpHgBhp2ht1Q8n8ym",
    "_version": 1,
    "_score": null,
    "_source": {
        "publishTime": "2021-04-06T01:36:09.422Z",
        "client_ips": "2601:247:c580:3337:45c0:dd63:35e0:9247",
        "body": {
            "events": "[{\"key\":\"Key  Launched\",\"count\":1,\"timestamp\":1617672914673,\"sum\":0},{\"key\":\"Viewed Screen\",\"count\":1,\"timestamp\":1617672969301,\"sum\":0}]",
            "sdk_name": "java-native-android",
            "tz": "-300"
        }
    }
}

Schema is defined below:

root
 |-- _id: string (nullable = true)
 |-- _index: string (nullable = true)
 |-- _score: string (nullable = true)
 |-- _source: struct (nullable = true)
 |    |-- body: struct (nullable = true)
 |    |    |-- events: string (nullable = true)
 |    |    |-- sdk_name: string (nullable = true)
 |    |    |-- tz: string (nullable = true)
 |    |-- client_ips: string (nullable = true)
 |    |-- publishTime: string (nullable = true)
 |-- _type: string (nullable = true)
 |-- _version: long (nullable = true)

Under _source.body.events , I see the datatype is string but its a dictonary with 2 different records. I would like to have them as 2 different rows with specific columns.

You can parse the events column using from_json , and reconstruct the _source column:

import pyspark.sql.functions as F

df2 = df.withColumn(
    '_source', 
    F.struct(
        F.struct(
            F.from_json(
                '_source.body.events',
                'array<struct<key:string, count:int, timestamp:long, sum:int>>'
            ).alias('events'), 
            '_source.body.sdk_name', 
            '_source.body.tz'
        ).alias('body'), 
        '_source.client_ips', 
        '_source.publishTime'
    )
)

df2.show(truncate=False)
+--------------------+-------------------------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+--------+
|_id                 |_index                   |_score|_source                                                                                                                                                                      |_type|_version|
+--------------------+-------------------------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+--------+
|hZ3SpHgBhp2ht1Q8n8ym|kl.service-log.2021.04.06|null  |[[[[Key  Launched, 1, 1617672914673, 0], [Viewed Screen, 1, 1617672969301, 0]], java-native-android, -300], 2601:247:c580:3337:45c0:dd63:35e0:9247, 2021-04-06T01:36:09.422Z]|_doc |1       |
+--------------------+-------------------------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+--------+

df2.printSchema()
root
 |-- _id: string (nullable = true)
 |-- _index: string (nullable = true)
 |-- _score: string (nullable = true)
 |-- _source: struct (nullable = false)
 |    |-- body: struct (nullable = false)
 |    |    |-- events: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- key: string (nullable = true)
 |    |    |    |    |-- count: integer (nullable = true)
 |    |    |    |    |-- timestamp: long (nullable = true)
 |    |    |    |    |-- sum: integer (nullable = true)
 |    |    |-- sdk_name: string (nullable = true)
 |    |    |-- tz: string (nullable = true)
 |    |-- client_ips: string (nullable = true)
 |    |-- publishTime: string (nullable = true)
 |-- _type: string (nullable = true)
 |-- _version: long (nullable = true)

If you want to explode the array to separate rows, you can operate on df2 obtained above:

df3 = df2.withColumn(
    'idx', 
    F.expr('explode(sequence(0, size(_source.body.events) - 1))')
).withColumn(
    '_source', 
    F.struct(
        F.struct(
            F.expr('_source.body.events[idx]'),
            '_source.body.sdk_name', 
            '_source.body.tz'
        ).alias('body'), 
        '_source.client_ips', 
        '_source.publishTime'
    )
).drop('idx')

df3.show(truncate=False)
+--------------------+-------------------------+------+-------------------------------------------------------------------------------------------------------------------------------------+-----+--------+
|_id                 |_index                   |_score|_source                                                                                                                              |_type|_version|
+--------------------+-------------------------+------+-------------------------------------------------------------------------------------------------------------------------------------+-----+--------+
|hZ3SpHgBhp2ht1Q8n8ym|kl.service-log.2021.04.06|null  |[[[Key  Launched, 1, 1617672914673, 0], java-native-android, -300], 2601:247:c580:3337:45c0:dd63:35e0:9247, 2021-04-06T01:36:09.422Z]|_doc |1       |
|hZ3SpHgBhp2ht1Q8n8ym|kl.service-log.2021.04.06|null  |[[[Viewed Screen, 1, 1617672969301, 0], java-native-android, -300], 2601:247:c580:3337:45c0:dd63:35e0:9247, 2021-04-06T01:36:09.422Z]|_doc |1       |
+--------------------+-------------------------+------+-------------------------------------------------------------------------------------------------------------------------------------+-----+--------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM