PySpark按条件计数值

Question

I have a DataFrame, a snippet here: 我有一个DataFrame，这里是一个代码段：

[['u1', 1], ['u2', 0]]

basically a string field named f and either a 1 or a 0 for second element ( is_fav ). 基本上是一个名为f的字符串字段，第二个元素（ is_fav ）为1或0。

What I need to do is grouping on the first field and counting the occurrences of 1s and 0s. 我需要做的是在第一个字段上分组并计算1和0的出现次数。 I was hoping to do something like 我希望做这样的事情

num_fav = count((col("is_fav") == 1)).alias("num_fav")

num_nonfav = count((col("is_fav") == 0)).alias("num_nonfav")

df.groupBy("f").agg(num_fav, num_nonfav)

It does not work properly, I get in both cases the same result which amounts to the count for the items in the group, so the filter (whether it is a 1 or a 0) seems to be ignored. 它无法正常工作，在两种情况下我都得到相同的结果，该结果等于组中项目的计数，因此似乎忽略了过滤器（无论是1还是0）。 Does this depend on how count works? 这是否取决于count工作方式？

Answer 1

There is no filter here. 这里没有过滤器。 Both col("is_fav") == 1 and col("is_fav") == 0) are just boolean expressions and count doesn't really care about their value as long as it is defined. col("is_fav") == 1和col("is_fav") == 0)都是布尔表达式，只要定义， count并不真正在乎它们的值。

There are many ways you can solve this for example by using simple sum : 例如，可以使用简单的sum来解决许多问题：

from pyspark.sql.functions import sum, abs

gpd = df.groupBy("f")
gpd.agg(
    sum("is_fav").alias("fv"),
    (count("is_fav") - sum("is_fav")).alias("nfv")
)

or making ignored values undefined (aka NULL ): 或使忽略的值不确定（又称为NULL ）：

exprs = [
    count(when(col("is_fav") == x, True)).alias(c)
    for (x, c) in [(1, "fv"), (0, "nfv")]
]
gpd.agg(*exprs)

PySpark按条件计数值

问题描述

1 个解决方案

解决方案1
14 已采纳 2016-03-18 00:41:08

PySpark按条件计数值

问题描述

1 个解决方案

解决方案1 14 已采纳 2016-03-18 00:41:08

解决方案1
14 已采纳 2016-03-18 00:41:08