[英]Pyspark : Return all column names of max values
I have a DataFrame like this:我有一个像这样的 DataFrame:
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType
#import numpy as np
data = [(("ID1", 3, 5,5)), (("ID2", 4, 5,6)), (("ID3", 3, 3,3))]
df = sqlContext.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()
cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())
maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5| 5|
|ID2| 4| 5| 6|
|ID3| 3| 3| 3|
+---+----+----+----+
+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3 |5 |5 |colB |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA |
+---+----+----+----+-------+
I want to return all column names of max values in case there are ties, how can I achieve this in pyspark like this:如果有关系,我想返回最大值的所有列名,如何在 pyspark 中实现这一点,如下所示:
+---+----+----+----+--------------+
|ID |colA|colB|colC|Max_col |
+---+----+----+----+--------------+
|ID1|3 |5 |5 |colB,colC |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA,ColB,ColC|
+---+----+----+----+--------------+
Thank you谢谢
Seems like a udf solution.似乎是一个 udf 解决方案。 iterate over the columns you have (pass them as an input to the class) and perform a python operations to get the max and check who has the same value.
遍历您拥有的列(将它们作为输入传递给类)并执行 python 操作以获取最大值并检查谁具有相同的值。 return a list (aka array) of the column names.
返回列名的列表(又名数组)。
@udf(returnType=ArrayType(StringType()))
def collect_same_max():
...
Or, maybe if it doable you can try use the transform
function from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html
Or, maybe if it doable you can try use the
transform
function from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html
Data Engineering, I would say.数据工程,我会说。 See code and logic below
请参阅下面的代码和逻辑
new =(df.withColumn('x',F.array(*[F.struct(F.lit(x).alias('col'),F.col(x).alias('num')) for x in df.columns if x!='ID']))#Create Struct Column of columns and their values
.selectExpr('ID','colA','colB','colC', 'inline(x)')#Explode struct column
.withColumn('z', first('num').over(Window.partitionBy('ID').orderBy(F.desc('num'))))#Create column with max value for each id
.where(col('num')==col('z'))#isolate max values in each id
.groupBy(['ID','colA','colB','colC']).agg(F.collect_list('col').alias('col'))#combine max columns into list
)
+---+----+----+----+------------------+
| ID|colA|colB|colC| Max_col|
+---+----+----+----+------------------+
|ID1| 3| 5| 5| [colB, colC]|
|ID2| 4| 5| 6| [colC]|
|ID3| 3| 3| 3|[colA, colB, colC]|
+---+----+----+----+------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.