简体   繁体   中英

How to iterate through an array struct and return the element I want in pyspark

Here is my example json file:

{"data":"example1","data2":"example2","register":[{"name":"John","last_name":"Travolta","age":68},{"name":"Nicolas","last_name":"Cage","age":58}], "data3":"example3","data4":"example4"}

And I have a data schema similar to this (totally illustrative):

root
 |-- register: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- age: long (nullable = true)
 |    |    |-- last_name: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- data: string (nullable = true)
 |-- data2: string (nullable = true)
 |-- data3: string (nullable = true)
 |-- data4: string (nullable = true)

What I want is to iterate inside this register, check if the name field is equal to eg John Travolta and create a new struct new_register (for example) with all the fields that are in the same index as the name.

I tried using some of spark's own functions, like filter, when, contains, but none of them gave me the desired result.

I also tried to implement a UDF, but I couldn't find a way to apply the function to the field I want.

How do I resolve the above problem?

First explode array field and access struct field with dot notation and filter required value.Here is the code.

df.printSchema()
df.show(10,False)

df1 = df.withColumn("new_struct",explode("register")).filter((col("new_struct.last_name") == 'Travolta') &  (col("new_struct.name") == 'John'))
df1.show(10,False)
df1.printSchema()


 root
 |-- data: string (nullable = true)
 |-- data2: string (nullable = true)
 |-- data3: string (nullable = true)
 |-- data4: string (nullable = true)
 |-- register: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- age: long (nullable = true)
 |    |    |-- last_name: string (nullable = true)
 |    |    |-- name: string (nullable = true)

+--------+--------+--------+--------+-------------------------------------------+
|data    |data2   |data3   |data4   |register                                   |
+--------+--------+--------+--------+-------------------------------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|
+--------+--------+--------+--------+-------------------------------------------+

+--------+--------+--------+--------+-------------------------------------------+--------------------+
|data    |data2   |data3   |data4   |register                                   |new_struct          |
+--------+--------+--------+--------+-------------------------------------------+--------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|{68, Travolta, John}|
+--------+--------+--------+--------+-------------------------------------------+--------------------+

root
 |-- data: string (nullable = true)
 |-- data2: string (nullable = true)
 |-- data3: string (nullable = true)
 |-- data4: string (nullable = true)
 |-- register: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- age: long (nullable = true)
 |    |    |-- last_name: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- new_struct: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- last_name: string (nullable = true)
 |    |-- name: string (nullable = true)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM