简体   繁体   中英

PySpark: Get top k column for each row in dataframe

I've a dataframe with scores for each offer for each contact. I want to to create a new dataframe out of this which has the top 3 offers for each contact.

The input dataframe is something like this:

| contact | offer 1 | offer 2 | offer 3 | offer 4 | offer 5 | offer 6 |
| name 1  | 0       | 3       | 1       |   2     |    1    |    6    |
| name 2  | 1       | 7       | 2       |   9     |    5    |    3    |

I want to convert it to dataframe like this:

| contact | best offer | second best offer | third best offer |
| name 1  | offer 6    | offer 2           | offer 4          |
| name 1  | offer 4    | offer 2           | offer 5          |

You'll need a few imports:

from pyspark.sql.functions import array, col, lit, sort_array, struct

With data as shown in the question:

df = sc.parallelize([
    ("name 1", 0, 3, 1, 2, 1, 6),
    ("name 2", 1, 7, 2, 9, 5, 3),
]).toDF(["contact"] + ["offer_{}".format(i) for i in range(1, 7)])

you can assemble and sort an array of structs :

offers = sort_array(array(*[
    struct(col(c).alias("v"), lit(c).alias("k")) for c in df.columns[1:]
]), asc=False)

and select :

    ["contact"] + [offers[i]["k"].alias("_{}".format(i)) for i in [0, 1, 2]])

which should give the following result:

|contact|     _0|     _1|     _2|
| name 1|offer_6|offer_2|offer_4|
| name 2|offer_4|offer_2|offer_5|

Rename the columns according to your needs and you're ready to go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM