How do we find - by row - the most frequent value among 4 two columns in a Spark DataFrame (pyspark 2.2.0)? Example of df:
col1 col2 col3 col4
13 15 14 14
Null 15 15 13
Null Null Null 13
Null Null Null Null
13 13 14 14
My call:
df = df.withColumn("frq", \
most_frequent(col("col1"),col("col2"),col("col3"),col("col4")) \
)
and the resulting df should be
col1 col2 col3 col4 df
13 15 14 14 14
Null 15 15 13 15
Null Null Null 13 13
Null Null Null Null Null
13 13 14 14 13
Note that the Null values should be omited from the calculation even if it's the most frequent value in a row (Null should be returned if all columns are Null though). Tied values (last line in df) can return any of the ties.
Write a udf using collections.Counter
:
from collections import Counter
from pyspark.sql.functions import udf
@udf
def mode(*v):
counter = Counter(x for x in v if x is not None)
if len(counter) > 0:
return counter.most_common(1)[0][0]
else:
return None
df.withColumn('mode', mode('col1', 'col2', 'col3', 'col4')).show()
+----+----+----+----+----+
|col1|col2|col3|col4|mode|
+----+----+----+----+----+
| 13| 15| 14| 14| 14|
|null| 15| 15| 13| 15|
|null|null|null| 13| 13|
|null|null|null|null|null|
| 13| 13| 14| 14| 13|
+----+----+----+----+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.