Pyspark : Return all column names of max values

Question

I have a DataFrame like this:

from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType

#import numpy as np

data = [(("ID1", 3, 5,5)), (("ID2", 4, 5,6)), (("ID3", 3, 3,3))]
df = sqlContext.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()

cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())


maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)

+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1|   3|   5|   5|
|ID2|   4|   5|   6|
|ID3|   3|   3|   3|
+---+----+----+----+

+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3   |5   |5   |colB   |
|ID2|4   |5   |6   |colC   |
|ID3|3   |3   |3   |colA   |
+---+----+----+----+-------+

I want to return all column names of max values in case there are ties, how can I achieve this in pyspark like this:

+---+----+----+----+--------------+
|ID |colA|colB|colC|Max_col       |
+---+----+----+----+--------------+
|ID1|3   |5   |5   |colB,colC     |
|ID2|4   |5   |6   |colC          |
|ID3|3   |3   |3   |colA,ColB,ColC|
+---+----+----+----+--------------+

Thank you

Answer 1

Seems like a udf solution. iterate over the columns you have (pass them as an input to the class) and perform a python operations to get the max and check who has the same value. return a list (aka array) of the column names.

@udf(returnType=ArrayType(StringType()))
def collect_same_max():
...

Or, maybe if it doable you can try use the transform function from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html

Answer 2

Data Engineering, I would say. See code and logic below

new =(df.withColumn('x',F.array(*[F.struct(F.lit(x).alias('col'),F.col(x).alias('num')) for x in df.columns if x!='ID']))#Create Struct Column of columns and their values
      .selectExpr('ID','colA','colB','colC', 'inline(x)')#Explode struct column
      .withColumn('z', first('num').over(Window.partitionBy('ID').orderBy(F.desc('num'))))#Create column with max value for each id
      .where(col('num')==col('z'))#isolate max values in each id
      .groupBy(['ID','colA','colB','colC']).agg(F.collect_list('col').alias('col'))#combine max columns into list
      
     )

+---+----+----+----+------------------+
| ID|colA|colB|colC|           Max_col|
+---+----+----+----+------------------+
|ID1|   3|   5|   5|      [colB, colC]|
|ID2|   4|   5|   6|            [colC]|
|ID3|   3|   3|   3|[colA, colB, colC]|
+---+----+----+----+------------------+

Pyspark : Return all column names of max values

Question

2 answers

solution1
0 2022-03-02 09:26:36

solution2
0 2022-09-17 07:36:08

Pyspark : Return all column names of max values

Question

2 answers

solution1 0 2022-03-02 09:26:36

solution2 0 2022-09-17 07:36:08

solution1
0 2022-03-02 09:26:36

solution2
0 2022-09-17 07:36:08