[英]Finding the most frequent value by row among n columns in a Spark dataframe
How do we find - by row - the most frequent value among 4 two columns in a Spark DataFrame (pyspark 2.2.0)? 我们如何在Spark DataFrame(pyspark 2.2.0)的4个两列中按行查找最频繁的值? Example of df:
df的示例:
col1 col2 col3 col4
13 15 14 14
Null 15 15 13
Null Null Null 13
Null Null Null Null
13 13 14 14
My call: 我的电话:
df = df.withColumn("frq", \
most_frequent(col("col1"),col("col2"),col("col3"),col("col4")) \
)
and the resulting df should be 并且结果df应该是
col1 col2 col3 col4 df
13 15 14 14 14
Null 15 15 13 15
Null Null Null 13 13
Null Null Null Null Null
13 13 14 14 13
Note that the Null values should be omited from the calculation even if it's the most frequent value in a row (Null should be returned if all columns are Null though). 请注意,即使Null值是行中最频繁的值,也应从计算中省略Null值(不过,如果所有列均为Null,则应返回Null)。 Tied values (last line in df) can return any of the ties.
绑定值(df中的最后一行)可以返回任何关系。
Write a udf using collections.Counter
: 使用
collections.Counter
编写udf:
from collections import Counter
from pyspark.sql.functions import udf
@udf
def mode(*v):
counter = Counter(x for x in v if x is not None)
if len(counter) > 0:
return counter.most_common(1)[0][0]
else:
return None
df.withColumn('mode', mode('col1', 'col2', 'col3', 'col4')).show()
+----+----+----+----+----+
|col1|col2|col3|col4|mode|
+----+----+----+----+----+
| 13| 15| 14| 14| 14|
|null| 15| 15| 13| 15|
|null|null|null| 13| 13|
|null|null|null|null|null|
| 13| 13| 14| 14| 13|
+----+----+----+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.