[英]How to add a new column to pySpark dataframe which contains count its column values which are greater to 0?
I want to add a new column to pyspark dataframe which contains count of all columns values which are greater to 0 in a particular row. 我想向pyspark数据框添加一个新列,其中包含在特定行中大于0的所有列值的计数。
Here is my demo dataframe. 这是我的演示数据框。
+-----------+----+----+----+----+----+----+
|customer_id|2010|2011|2012|2013|2014|2015|
+-----------+----+----+----+----+----+----+
| 1 | 0 | 4 | 0 | 32 | 0 | 87 |
| 2 | 5 | 5 | 56 | 23 | 0 | 09 |
| 3 | 6 | 6 | 87 | 0 | 45 | 23 |
| 4 | 7 | 0 | 12 | 89 | 78 | 0 |
| 6 | 0 | 0 | 0 | 23 | 45 | 64 |
+-----------+----+----+----+----+----+----+
Above data frame have visit by a customer in a year. 以上数据框架一年内被客户拜访。 I want to count how many years a customer visited. 我想计算一个客户拜访了多少年。 So i need a column visit_count which is having count of visits in year (2010,2011,2012,2013,2014,2015) having value greater to 0. 所以我需要一列visit_count ,该列的访问量在年份(2010,2011,2012,2013,2014,2015)中大于0。
+-----------+----+----+----+----+----+----+-----------+
|customer_id|2010|2011|2012|2013|2014|2015|visit_count|
+-----------+----+----+----+----+----+----+-----------+
| 1 | 0 | 4 | 0 | 32 | 0 | 87 | 3 |
| 2 | 5 | 5 | 56 | 23 | 0 | 09 | 5 |
| 3 | 6 | 6 | 87 | 0 | 45 | 23 | 5 |
| 4 | 7 | 0 | 12 | 89 | 78 | 0 | 4 |
| 6 | 0 | 0 | 0 | 23 | 45 | 64 | 3 |
+-----------+----+----+----+----+----+----+-----------+
How to Achieve this? 如何做到这一点?
尝试这个:
df.withColumn('visit_count', sum((df[col] > 0).cast('integer') for col in df.columns))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.