简体   繁体   中英

Finding the most frequent value by row among n columns in a Spark dataframe

How do we find - by row - the most frequent value among 4 two columns in a Spark DataFrame (pyspark 2.2.0)? Example of df:

col1 col2 col3 col4
13   15   14   14
Null 15   15   13
Null Null Null 13
Null Null Null Null
13   13   14   14

My call:

df = df.withColumn("frq", \
        most_frequent(col("col1"),col("col2"),col("col3"),col("col4")) \
        )

and the resulting df should be

col1 col2 col3 col4  df
13   15   14   14    14
Null 15   15   13    15
Null Null Null 13    13
Null Null Null Null  Null
13   13   14   14    13

Note that the Null values should be omited from the calculation even if it's the most frequent value in a row (Null should be returned if all columns are Null though). Tied values (last line in df) can return any of the ties.

Write a udf using collections.Counter :

from collections import Counter
from pyspark.sql.functions import udf

@udf
def mode(*v):
  counter = Counter(x for x in v if x is not None)
  if len(counter) > 0:
    return counter.most_common(1)[0][0]
  else:
    return None

df.withColumn('mode', mode('col1', 'col2', 'col3', 'col4')).show()
+----+----+----+----+----+
|col1|col2|col3|col4|mode|
+----+----+----+----+----+
|  13|  15|  14|  14|  14|
|null|  15|  15|  13|  15|
|null|null|null|  13|  13|
|null|null|null|null|null|
|  13|  13|  14|  14|  13|
+----+----+----+----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM