简体   繁体   中英

PySpark how to sort array of struct with 2 elements

I have the following schema

|-- node: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- score: string (nullable = true)
 |    |    |-- toID: string (nullable = true)

+-----------
 node|
+-------------
|[[57753, XV0912938...|
|[[51959, 02384848...|
|[[63487, 898898989...|

I want to sort based on score and select top 10 toID . What is an easy pyspark way to do this?

sort_array(<array column>, asc=False) function can be used to sort the elements within the array. If the elements are structs, the array is sorted based on the first field in the struct. Since, the score field is the first field in the struct, you can directly sort the array of structs using this function.

Here's an example to get the top 5 IDs based on the score within the struct.

Suppose your data looks like the following

+----------------------------------------------------------------------------------------------------------------------------------------+
|arr_of_structs                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------+
|[{270, foo0}, {237, foo1}, {2440, foo2}, {2059, foo3}, {3960, foo4}, {3009, foo5}, {842, foo6}, {1043, foo7}, {852, foo8}, {3925, foo9}]|
|[{1028, bar0}, {2571, bar1}, {1682, bar2}, {2543, bar3}, {689, bar4}, {11, bar5}, {1, bar6}, {740, bar7}, {113, bar8}, {701, bar9}]     |
+----------------------------------------------------------------------------------------------------------------------------------------+

root
 |-- arr_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: long (nullable = true)
 |    |    |-- _2: string (nullable = true)

The code would be very simple

  • use sort_array to sort the array of structs
  • use slice function to retain the first N elements in the sorted array.
data_sdf. \
    withColumn('arr_of_structs_top5', func.slice(func.sort_array('arr_of_structs', asc=False), 1, 5)). \
    withColumn('top5_id_only', func.expr('transform(arr_of_structs_top5, x -> x._2)')). \
    drop('arr_of_structs'). \
    show(truncate=False)

# +----------------------------------------------------------------------+------------------------------+
# |arr_of_structs_top5                                                   |top5_id_only                  |
# +----------------------------------------------------------------------+------------------------------+
# |[{3960, foo4}, {3925, foo9}, {3009, foo5}, {2440, foo2}, {2059, foo3}]|[foo4, foo9, foo5, foo2, foo3]|
# |[{2571, bar1}, {2543, bar3}, {1682, bar2}, {1028, bar0}, {740, bar7}] |[bar1, bar3, bar2, bar0, bar7]|
# +----------------------------------------------------------------------+------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM