有效地获取每列 pyspark 数据帧的总和

Question

I work in Spark 1.6 (unfortunately).我在 Spark 1.6 中工作（不幸的是）。 I have a dataframe with many columns with 0's and 1's as values.我有一个数据框，其中包含许多以 0 和 1 作为值的列。 I want to take the percentage of 1's per column.我想取每列 1 的百分比。 So I do:所以我这样做：

rowsNum = dfBinary.count()
dfStat = dfBinary.select([(count(when(col(c) == 1 , c))/rowsNum).
alias(c) for c in dfBinary.columns])

Is there a more efficient way to do this?有没有更有效的方法来做到这一点？ Maybe a built in function with sum per column (I did not find any though).也许是每列总和的内置函数（虽然我没有找到）。

Answer 1

you can use sum() from functions module,您可以使用函数模块中的 sum()，

from pyspark.sql.functions import sum
dfBinary.select([(sum(c)/rowsNum).alias(c) for c in dfBinary.columns]).show()

Answer 2

You can replace count and division with mean to avoid additional data scan您可以用mean替换count和除法以避免额外的数据扫描

from pyspark.sql.functions import mean

dfStat = dfBinary.select([
    (mean(when(col(c) == 1 , c))).
    alias(c) for c in dfBinary.columns])

but otherwise, it is as efficient as you can get.但除此之外，它会尽可能高效。

有效地获取每列 pyspark 数据帧的总和

问题描述

2 个解决方案

解决方案1
1 2017-10-04 10:32:48

解决方案2
1 已采纳 2017-10-04 10:33:35

有效地获取每列 pyspark 数据帧的总和

问题描述

2 个解决方案

解决方案1 1 2017-10-04 10:32:48

解决方案2 1 已采纳 2017-10-04 10:33:35

解决方案1
1 2017-10-04 10:32:48

解决方案2
1 已采纳 2017-10-04 10:33:35