简体   繁体   中英

Spark- get column names as a list from nested data

For the dataframe below, which was generated from an avro file, I'm trying to get the column names as a list or other format so that I can use it in a select statement. node1 and node2 have the same elements. For example I understand that we could do df.select(col('data.node1.name')) , but I'm not sure 1)how to select all columns at once without hardcode all the column names, and 2) how to handle the nested part. I think to make it readable,the productvalues and porders should be select into separate individual dataframes/tables? Many thanks for your help.

Input schema:

root
  |-- metadata: struct
  |...
  |-- data :struct 
  |    |--node1 : struct
  |    |   |--name : string
  |    |   |--productlist: array
  |    |        |--element : struct
       |              |--productvalues: array
       |                   |--element : struct
       |                         |-- pname:string
       |                         |-- porders:array
       |                                |--element : struct
       |                                      |-- ordernum: int
       |                                      |-- field: string
       |--node2 : struct
  |        |--name : string
  |        |--productlist: array
  |             |--element : struct
                      |--productvalues: array
                          |--element : struct
                                 |-- pname:string
                                 |-- porders:array
                                        |--element : struct
                                              |-- ordernum: int
                                              |-- field: string

Instead of gathering all data in one table I would recommend you making more tables for each list. To get values from list you can use the "explode" function.

for example for making productlist table

productlist = df.select(col('data.node1.name').alias("name"), explode(col('data.node1.productlist'))).alias("first_explode"))

in the next step you can use the productlist df and do this

productValue = df.select(col('productlist.name'),col('productlist.node1.first_explode.element'),explode(col('productlist.node1.first_explode.productvalues')).alias("second_explode"))

and so on..

you can also get some help from this link as well https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/

So, this is not a perfect answer for you, but I hope it might give you some ideas to solve your problem. I know you said you don't want to hardcode your column names but I'm unable to handle that part at this moment.

First thing first, I created this sample JSON for testing

{
    "metadata": {},
    "data": {
        "node1": {
            "name": "Node001",
            "productlist": [
                {
                    "productvalues": [
                        {
                            "pname": "Node001-P001",
                            "porders": [
                                {"ordernum":  1, "field": "Node001-P001-001"},
                                {"ordernum":  2, "field": "Node001-P001-002"}
                            ]
                        },
                        {
                            "pname": "Node001-P002",
                            "porders": [
                                {"ordernum":  3, "field": "Node001-P002-003"},
                                {"ordernum":  4, "field": "Node001-P002-004"},
                                {"ordernum":  5, "field": "Node001-P002-005"},
                                {"ordernum":  6, "field": "Node001-P002-006"}
                            ]
                        },
                        {
                            "pname": "Node001-P003",
                            "porders": [
                                {"ordernum":  7, "field": "Node001-P003-007"}
                            ]
                        }
                    ]
                },
                {
                    "productvalues": [
                        {
                            "pname": "Node001-P004",
                            "porders": [
                                {"ordernum":  8, "field": "Node001-P004-008"},
                                {"ordernum":  9, "field": "Node001-P004-009"},
                                {"ordernum": 10, "field": "Node001-P004-010"}
                            ]
                        },
                        {
                            "pname": "Node001-P005",
                            "porders": [
                                
                                {"ordernum": 11, "field": "Node001-P005-011"},
                                {"ordernum": 12, "field": "Node001-P005-012"},
                                {"ordernum": 13, "field": "Node001-P005-013"}
                            ]
                        }
                    ]
                }
            ]
        },
        "node2": {
            "name": "Node002",
            "productlist": [
                {
                    "productvalues": [
                        {
                            "pname": "Node002-P001",
                            "porders": [
                                {"ordernum": 14, "field": "Node002-P001-014"}
                            ]
                        },
                        {
                            "pname": "Node002-P002",
                            "porders": [
                                {"ordernum": 15, "field": "Node002-P002-015"}
                            ]
                        },
                        {
                            "pname": "Node002-P003",
                            "porders": [
                                {"ordernum": 16, "field": "Node002-P003-016"}
                            ]
                        }
                    ]
                },
                {
                    "productvalues": [
                        {
                            "pname": "Node002-P004",
                            "porders": [
                                {"ordernum": 17, "field": "Node002-P004-017"}
                            ]
                        },
                        {
                            "pname": "Node002-P005",
                            "porders": [
                                
                                {"ordernum": 18, "field": "Node002-P005-018"}
                            ]
                        }
                    ]
                }
            ]
        }
    }
}

Now, this is a "dict-like" column that you need to use for later

cols_dict = [
    {
        'col': ['data.node1.name'],
        'exp': ['data.node1.productlist'],
    },
    {
        'exp': ['productlist.productvalues'],
    },
    {
        'col': ['productvalues.pname'],
        'exp': ['productvalues.porders'],
    },
    {
        'col': ['porders.ordernum', 'porders.field']
    }
]

And finally, loop through this dict and add some transformation to get your final result

dfx = df
select_col = []
for i in cols_dict:
    select_col = [c.split('.')[-1] for c in select_col]
    if i.get('col'):
        select_col += i['col']
    
    select_exp = []
    if i.get('exp'):
        select_exp += i['exp']

    dfx = dfx.select([F.col(c) for c in select_col] + [F.explode(c).alias(c.split('.')[-1]) for c in select_exp])

+-------+------------+--------+----------------+
|   name|       pname|ordernum|           field|
+-------+------------+--------+----------------+
|Node001|Node001-P001|       1|Node001-P001-001|
|Node001|Node001-P001|       2|Node001-P001-002|
|Node001|Node001-P002|       3|Node001-P002-003|
|Node001|Node001-P002|       4|Node001-P002-004|
|Node001|Node001-P002|       5|Node001-P002-005|
|Node001|Node001-P002|       6|Node001-P002-006|
|Node001|Node001-P003|       7|Node001-P003-007|
|Node001|Node001-P004|       8|Node001-P004-008|
|Node001|Node001-P004|       9|Node001-P004-009|
|Node001|Node001-P004|      10|Node001-P004-010|
|Node001|Node001-P005|      11|Node001-P005-011|
|Node001|Node001-P005|      12|Node001-P005-012|
|Node001|Node001-P005|      13|Node001-P005-013|
+-------+------------+--------+----------------+

The following way, you will not need to hardcode all the struct fields. But you will need to provide a list of those columns which have the type of array of struct . You have 3 of such columns, we will create one additional, so 4 in total.

First of all, the dataset, similar to yours:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(
        ('a', 'b'),
        (
            (
                'name_1',
                [
                    ([
                        (
                            'pname_111',
                            [
                                (1111, 'field_1111'),
                                (1112, 'field_1112')
                            ]
                        ),
                        (
                            'pname_112',
                            [
                                (1121, 'field_1121'),
                                (1122, 'field_1122')
                            ]
                        )
                    ],),
                    ([
                        (
                            'pname_121',
                            [
                                (1211, 'field_1211'),
                                (1212, 'field_1212')
                            ]
                        ),
                        (
                            'pname_122',
                            [
                                (1221, 'field_1221'),
                                (1222, 'field_1222')
                            ]
                        )
                    ],)
                ]
            ),
            (
                'name_2',
                [
                    ([
                        (
                            'pname_211',
                            [
                                (2111, 'field_2111'),
                                (2112, 'field_2112')
                            ]
                        ),
                        (
                            'pname_212',
                            [
                                (2121, 'field_2121'),
                                (2122, 'field_2122')
                            ]
                        )
                    ],),
                    ([
                        (
                            'pname_221',
                            [
                                (2211, 'field_2211'),
                                (2212, 'field_2212')
                            ]
                        ),
                        (
                            'pname_222',
                            [
                                (2221, 'field_2221'),
                                (2222, 'field_2222')
                            ]
                        )
                    ],)
                ]
            )
        ),
    )],
    'metadata:struct<fld1:string,fld2:string>, data:struct<node1:struct<name:string, productlist:array<struct<productvalues:array<struct<pname:string, porders:array<struct<ordernum:int, field:string>>>>>>>, node2:struct<name:string, productlist:array<struct<productvalues:array<struct<pname:string, porders:array<struct<ordernum:int, field:string>>>>>>>>'
)
# df.printSchema()
# root
#  |-- metadata: struct (nullable = true)
#  |    |-- fld1: string (nullable = true)
#  |    |-- fld2: string (nullable = true)
#  |-- data: struct (nullable = true)
#  |    |-- node1: struct (nullable = true)
#  |    |    |-- name: string (nullable = true)
#  |    |    |-- productlist: array (nullable = true)
#  |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |-- productvalues: array (nullable = true)
#  |    |    |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |    |    |-- pname: string (nullable = true)
#  |    |    |    |    |    |    |-- porders: array (nullable = true)
#  |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |    |    |    |    |-- ordernum: integer (nullable = true)
#  |    |    |    |    |    |    |    |    |-- field: string (nullable = true)
#  |    |-- node2: struct (nullable = true)
#  |    |    |-- name: string (nullable = true)
#  |    |    |-- productlist: array (nullable = true)
#  |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |-- productvalues: array (nullable = true)
#  |    |    |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |    |    |-- pname: string (nullable = true)
#  |    |    |    |    |    |    |-- porders: array (nullable = true)
#  |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |    |    |    |    |-- ordernum: integer (nullable = true)
#  |    |    |    |    |    |    |    |    |-- field: string (nullable = true)

The answer

nodes = df.select("data.*").columns
for n in nodes:
    df = df.withColumn("data", F.col("data").withField(n, F.struct(F.lit(n).alias("node"), f"data.{n}.*")))
df = df.withColumn("data", F.array("data.*"))

for arr_of_struct in ["data", "productlist", "productvalues", "porders"]:
    df = df.select(
        *[c for c in df.columns if c != arr_of_struct],
        F.expr(f"inline({arr_of_struct})")
    )

Results:

df.printSchema()
# root
#  |-- metadata: struct (nullable = true)
#  |    |-- fld1: string (nullable = true)
#  |    |-- fld2: string (nullable = true)
#  |-- node: string (nullable = false)
#  |-- name: string (nullable = true)
#  |-- pname: string (nullable = true)
#  |-- ordernum: integer (nullable = true)
#  |-- field: string (nullable = true)

df.show()
# +--------+-----+------+---------+--------+----------+
# |metadata| node|  name|    pname|ordernum|     field|
# +--------+-----+------+---------+--------+----------+
# |  {a, b}|node1|name_1|pname_111|    1111|field_1111|
# |  {a, b}|node1|name_1|pname_111|    1112|field_1112|
# |  {a, b}|node1|name_1|pname_112|    1121|field_1121|
# |  {a, b}|node1|name_1|pname_112|    1122|field_1122|
# |  {a, b}|node1|name_1|pname_121|    1211|field_1211|
# |  {a, b}|node1|name_1|pname_121|    1212|field_1212|
# |  {a, b}|node1|name_1|pname_122|    1221|field_1221|
# |  {a, b}|node1|name_1|pname_122|    1222|field_1222|
# |  {a, b}|node2|name_2|pname_211|    2111|field_2111|
# |  {a, b}|node2|name_2|pname_211|    2112|field_2112|
# |  {a, b}|node2|name_2|pname_212|    2121|field_2121|
# |  {a, b}|node2|name_2|pname_212|    2122|field_2122|
# |  {a, b}|node2|name_2|pname_221|    2211|field_2211|
# |  {a, b}|node2|name_2|pname_221|    2212|field_2212|
# |  {a, b}|node2|name_2|pname_222|    2221|field_2221|
# |  {a, b}|node2|name_2|pname_222|    2222|field_2222|
# +--------+-----+------+---------+--------+----------+

Explanation

nodes = df.select("data.*").columns
for n in nodes:
    df = df.withColumn("data", F.col("data").withField(n, F.struct(F.lit(n).alias("node"), f"data.{n}.*")))

Using the above, I decided to save the node title in case you need it. It first gets a list of nodes from "data" column fields. The for loop creates one more field inside every node struct for the title of the node. Finally, we convert the "data" column type from struct to array so that in the next step we could easily explode it into columns.

for arr_of_struct in ["data", "productlist", "productvalues", "porders"]:
    df = df.select(
        *[c for c in df.columns if c != arr_of_struct],
        F.expr(f"inline({arr_of_struct})")
    )

In the above, the main line is F.expr(f"inline({arr_of_struct})") . It must be used inside a loop, because it's a generator and you cannot nest them together in Spark. inline explodes arrays of structs into columns. At this step you have 4 of [array of struct], so 4 inline expressions will be created.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM