PySpark: Replace values of dataframe based on criteria

Question

I have a dataframe as show below

+++++++++++++++++++++
colA | colB | colC |
+++++++++++++++++++++
123  | 3 | 0|
222  | 0 | 1|
200  | 0 | 2|

I want to replace the values in colB and colC with a value of 1 if they are greater than 0.

I am able to use the na.fill function if I need to fill nulls with 0. But I am not sure how to do this.

Answer 1

Assuming your dataframe is df, then you can do the following:

from pyspark.sql.functions import when  

df = df.select('colA', 
                   when(df.colB > 0, 1).alias('colB'),
                   when(df.colB > 0, 1).alias('colC'))

This checks whether colB and colC are greater than 0 and assign 1.

PySpark: Replace values of dataframe based on criteria

Question

1 answers

solution1
0 ACCPTED 2017-10-16 16:20:50

PySpark: Replace values of dataframe based on criteria

Question

1 answers

solution1 0 ACCPTED 2017-10-16 16:20:50

solution1
0 ACCPTED 2017-10-16 16:20:50