有没有办法计算 spark df 中每行的非空值？

Question

I have a very wide df with a large number of columns.我有一个非常宽的 df 和大量的列。 I need to get the count of non-null values per row for this in python.我需要在 python 中获取每行非空值的计数。

Example DF -示例 DF -

+-----+----------+-----+-----+-----+-----+-----+-----+
| name|      date|col01|col02|col03|col04|col05|col06|
+-----+----------+-----+-----+-----+-----+-----+-----+
|name1|2017-12-01|100.0|255.5|333.3| null|125.2|132.7|
|name2|2017-12-01|101.1|105.5| null| null|127.5| null|

I want to add a column with a count of non-null values in col01-col06 -我想在 col01-col06 中添加一个包含非空值计数的列 -

+-----+----------+-----+-----+-----+-----+-----+-----+-----+
| name|      date|col01|col02|col03|col04|col05|col06|count|
+-----+----------+-----+-----+-----+-----+-----+-----+-----+
|name1|2017-12-01|100.0|255.5|333.3| null|125.2|132.7|    5| 
|name2|2017-12-01|101.1|105.5| null| null|127.5| null|    3|

I was able to get this in a pandas df like this -我能够在这样的 Pandas df 中得到这个 -

df['count']=df.loc[:,'col01':'col06'].notnull().sum(axis=1)

But no luck with spark df so far :( Any ideas?但是到目前为止，spark df 没有运气:( 有什么想法吗？

Answer 1

Convert the null values to true / false , then to integers, then sum them:将null值转换为true / false ，然后转换为整数，然后对它们求和：

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

df = spark.createDataFrame([[1, None, None, 0], 
                            [2, 3, 4, None], 
                            [None, None, None, None], 
                            [1, 5, 7, 2]], 'a: int, b: int, c: int, d: int')

df.select(sum([F.isnull(df[col]).cast(IntegerType()) for col in df.columns]).alias('null_count')).show()

Output:输出：

+----------+
|null_count|
+----------+
|         2|
|         1|
|         4|
|         0|
+----------+

有没有办法计算 spark df 中每行的非空值？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-04-05 02:28:30

有没有办法计算 spark df 中每行的非空值？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-04-05 02:28:30

解决方案1
1 已采纳 2019-04-05 02:28:30