I have below data:
>>> dfStd1.show()
+---+----+------+-------+-----------------------------------------------+------+
| id|Name|Seq_Id|Carrier|CASE WHEN (NOT (Seq_Id = 1)) THEN 0 ELSE 12 END|string|
+---+----+------+-------+-----------------------------------------------+------+
| 0| 0| 0| 2| 0| 0|
+---+----+------+-------+-----------------------------------------------+------+
So, here I need names of column which has value greater than 0. Here, for eg I need Carrier
column I need to store such values in a list. I tried below code but doesn't work and also referred many SO links but no luck:
>>> dfStd1[(dfStd1 > 0).any(axis=1)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'DataFrame' and 'int'
It throws error as above. I even tried to convert it into pandas then filter it out but no result.
First you need columns which are numeric:
schema = {col: col_type for col, col_type in df.dtypes}
numeric_cols = [
col
for col, col_type in schema.items()
if col_type in "int double bigint".split()
]
Then you can count number of elements in a column that are greater than 0 by using:
from pyspark.sql.functions import when, col
count_cols_gt_zero = [
json.loads(x)
for x in self.data.select(
[count(when(col(c) > 0, c)).alias(c) for c in self.schema]
)
.toJSON()
.collect()
][0]
Then finally:
final = [x for x, y in count_cols_gt_zero.items() if y > 0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.