简体   繁体   English

在Spark数据帧的n列中按行查找最频繁的值

[英]Finding the most frequent value by row among n columns in a Spark dataframe

How do we find - by row - the most frequent value among 4 two columns in a Spark DataFrame (pyspark 2.2.0)? 我们如何在Spark DataFrame(pyspark 2.2.0)的4个两列中按行查找最频繁的值? Example of df: df的示例:

col1 col2 col3 col4
13   15   14   14
Null 15   15   13
Null Null Null 13
Null Null Null Null
13   13   14   14

My call: 我的电话:

df = df.withColumn("frq", \
        most_frequent(col("col1"),col("col2"),col("col3"),col("col4")) \
        )

and the resulting df should be 并且结果df应该是

col1 col2 col3 col4  df
13   15   14   14    14
Null 15   15   13    15
Null Null Null 13    13
Null Null Null Null  Null
13   13   14   14    13

Note that the Null values should be omited from the calculation even if it's the most frequent value in a row (Null should be returned if all columns are Null though). 请注意,即使Null值是行中最频繁的值,也应从计算中省略Null值(不过,如果所有列均为Null,则应返回Null)。 Tied values (last line in df) can return any of the ties. 绑定值(df中的最后一行)可以返回任何关系。

Write a udf using collections.Counter : 使用collections.Counter编写udf:

from collections import Counter
from pyspark.sql.functions import udf

@udf
def mode(*v):
  counter = Counter(x for x in v if x is not None)
  if len(counter) > 0:
    return counter.most_common(1)[0][0]
  else:
    return None

df.withColumn('mode', mode('col1', 'col2', 'col3', 'col4')).show()
+----+----+----+----+----+
|col1|col2|col3|col4|mode|
+----+----+----+----+----+
|  13|  15|  14|  14|  14|
|null|  15|  15|  13|  15|
|null|null|null|  13|  13|
|null|null|null|null|null|
|  13|  13|  14|  14|  13|
+----+----+----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM