简体   繁体   English

有效地获取每列 pyspark 数据帧的总和

[英]Take the sum of a pyspark dataframe per column efficiently

I work in Spark 1.6 (unfortunately).我在 Spark 1.6 中工作(不幸的是)。 I have a dataframe with many columns with 0's and 1's as values.我有一个数据框,其中包含许多以 0 和 1 作为值的列。 I want to take the percentage of 1's per column.我想取每列 1 的百分比。 So I do:所以我这样做:

rowsNum = dfBinary.count()
dfStat = dfBinary.select([(count(when(col(c) == 1 , c))/rowsNum).
alias(c) for c in dfBinary.columns])

Is there a more efficient way to do this?有没有更有效的方法来做到这一点? Maybe a built in function with sum per column (I did not find any though).也许是每列总和的内置函数(虽然我没有找到)。

you can use sum() from functions module,您可以使用函数模块中的 sum(),

from pyspark.sql.functions import sum
dfBinary.select([(sum(c)/rowsNum).alias(c) for c in dfBinary.columns]).show()

You can replace count and division with mean to avoid additional data scan您可以用mean替换count和除法以避免额外的数据扫描

from pyspark.sql.functions import mean

dfStat = dfBinary.select([
    (mean(when(col(c) == 1 , c))).
    alias(c) for c in dfBinary.columns])

but otherwise, it is as efficient as you can get.但除此之外,它会尽可能高效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM