[英]Take the sum of a pyspark dataframe per column efficiently
I work in Spark 1.6 (unfortunately).我在 Spark 1.6 中工作(不幸的是)。 I have a dataframe with many columns with 0's and 1's as values.
我有一个数据框,其中包含许多以 0 和 1 作为值的列。 I want to take the percentage of 1's per column.
我想取每列 1 的百分比。 So I do:
所以我这样做:
rowsNum = dfBinary.count()
dfStat = dfBinary.select([(count(when(col(c) == 1 , c))/rowsNum).
alias(c) for c in dfBinary.columns])
Is there a more efficient way to do this?有没有更有效的方法来做到这一点? Maybe a built in function with sum per column (I did not find any though).
也许是每列总和的内置函数(虽然我没有找到)。
you can use sum() from functions module,您可以使用函数模块中的 sum(),
from pyspark.sql.functions import sum
dfBinary.select([(sum(c)/rowsNum).alias(c) for c in dfBinary.columns]).show()
You can replace count
and division with mean
to avoid additional data scan您可以用
mean
替换count
和除法以避免额外的数据扫描
from pyspark.sql.functions import mean
dfStat = dfBinary.select([
(mean(when(col(c) == 1 , c))).
alias(c) for c in dfBinary.columns])
but otherwise, it is as efficient as you can get.但除此之外,它会尽可能高效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.