Finding the most frequent value by row among n columns in a Spark dataframe

Question

How do we find - by row - the most frequent value among 4 two columns in a Spark DataFrame (pyspark 2.2.0)? Example of df:

col1 col2 col3 col4
13   15   14   14
Null 15   15   13
Null Null Null 13
Null Null Null Null
13   13   14   14

My call:

df = df.withColumn("frq", \
        most_frequent(col("col1"),col("col2"),col("col3"),col("col4")) \
        )

and the resulting df should be

col1 col2 col3 col4  df
13   15   14   14    14
Null 15   15   13    15
Null Null Null 13    13
Null Null Null Null  Null
13   13   14   14    13

Note that the Null values should be omited from the calculation even if it's the most frequent value in a row (Null should be returned if all columns are Null though). Tied values (last line in df) can return any of the ties.

Answer 1

Write a udf using collections.Counter :

from collections import Counter
from pyspark.sql.functions import udf

@udf
def mode(*v):
  counter = Counter(x for x in v if x is not None)
  if len(counter) > 0:
    return counter.most_common(1)[0][0]
  else:
    return None

df.withColumn('mode', mode('col1', 'col2', 'col3', 'col4')).show()
+----+----+----+----+----+
|col1|col2|col3|col4|mode|
+----+----+----+----+----+
|  13|  15|  14|  14|  14|
|null|  15|  15|  13|  15|
|null|null|null|  13|  13|
|null|null|null|null|null|
|  13|  13|  14|  14|  13|
+----+----+----+----+----+

Finding the most frequent value by row among n columns in a Spark dataframe

Question

1 answers

solution1
1 ACCPTED 2018-10-14 19:06:33

Finding the most frequent value by row among n columns in a Spark dataframe

Question

1 answers

solution1 1 ACCPTED 2018-10-14 19:06:33

solution1
1 ACCPTED 2018-10-14 19:06:33