Pyspark：返回最大值的所有列名

Question

I have a DataFrame like this:我有一个像这样的 DataFrame：

from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType

#import numpy as np

data = [(("ID1", 3, 5,5)), (("ID2", 4, 5,6)), (("ID3", 3, 3,3))]
df = sqlContext.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()

cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())


maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)

+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1|   3|   5|   5|
|ID2|   4|   5|   6|
|ID3|   3|   3|   3|
+---+----+----+----+

+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3   |5   |5   |colB   |
|ID2|4   |5   |6   |colC   |
|ID3|3   |3   |3   |colA   |
+---+----+----+----+-------+

I want to return all column names of max values in case there are ties, how can I achieve this in pyspark like this:如果有关系，我想返回最大值的所有列名，如何在 pyspark 中实现这一点，如下所示：

+---+----+----+----+--------------+
|ID |colA|colB|colC|Max_col       |
+---+----+----+----+--------------+
|ID1|3   |5   |5   |colB,colC     |
|ID2|4   |5   |6   |colC          |
|ID3|3   |3   |3   |colA,ColB,ColC|
+---+----+----+----+--------------+

Thank you谢谢

Answer 1

Seems like a udf solution.似乎是一个 udf 解决方案。 iterate over the columns you have (pass them as an input to the class) and perform a python operations to get the max and check who has the same value.遍历您拥有的列（将它们作为输入传递给类）并执行 python 操作以获取最大值并检查谁具有相同的值。 return a list (aka array) of the column names.返回列名的列表（又名数组）。

@udf(returnType=ArrayType(StringType()))
def collect_same_max():
...

Or, maybe if it doable you can try use the transform function from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html Or, maybe if it doable you can try use the transform function from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html

Answer 2

Data Engineering, I would say.数据工程，我会说。 See code and logic below请参阅下面的代码和逻辑

new =(df.withColumn('x',F.array(*[F.struct(F.lit(x).alias('col'),F.col(x).alias('num')) for x in df.columns if x!='ID']))#Create Struct Column of columns and their values
      .selectExpr('ID','colA','colB','colC', 'inline(x)')#Explode struct column
      .withColumn('z', first('num').over(Window.partitionBy('ID').orderBy(F.desc('num'))))#Create column with max value for each id
      .where(col('num')==col('z'))#isolate max values in each id
      .groupBy(['ID','colA','colB','colC']).agg(F.collect_list('col').alias('col'))#combine max columns into list
      
     )

+---+----+----+----+------------------+
| ID|colA|colB|colC|           Max_col|
+---+----+----+----+------------------+
|ID1|   3|   5|   5|      [colB, colC]|
|ID2|   4|   5|   6|            [colC]|
|ID3|   3|   3|   3|[colA, colB, colC]|
+---+----+----+----+------------------+

Pyspark：返回最大值的所有列名

问题描述

2 个解决方案

解决方案1
0 2022-03-02 09:26:36

解决方案2
0 2022-09-17 07:36:08

Pyspark：返回最大值的所有列名

问题描述

2 个解决方案

解决方案1 0 2022-03-02 09:26:36

解决方案2 0 2022-09-17 07:36:08

解决方案1
0 2022-03-02 09:26:36

解决方案2
0 2022-09-17 07:36:08