How to use DataFrame.withColumn with a condition

Question

I want know how to create a new column on the full DataFrame but whose values are based on only a subset of the DataFrame (ie some function applied based on a condition).

In this example, I want to create a column that associates each ID with the sum of only positive values in the Value column by ID.

Using this is the dataset:

+-----+-----+
|  ID |Value|
+-----+-----+
|  1  |  -4 |
+-----+-----+
|  2  |  5  |
+-----+-----+
|  2  | -23 |
+-----+-----+
|  1  |  5  |
+-----+-----+
|  2  | 19  |
+-----+-----+
|  1  | 12  |
+-----+-----+

I'm looking to get the following result:

+-----+-----+-------------+
|  ID |Value| sum_pos_val |
+-----+-----+-------------+
|  1  |  -4 |     17      |
+-----+-----+-------------+
|  2  |  5  |     24      |
+-----+-----+-------------+
|  2  | -23 |     24      |
+-----+-----+-------------+
|  1  |  5  |     17      |
+-----+-----+-------------+
|  2  | 19  |     24      |
+-----+-----+-------------+
|  1  | 12  |     17      |
+-----+-----+-------------+

The following code will sum the Value column by ID but how can I filter out only positive values to sum over?

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy('ID')
df.withColumn('sum_pos_val', f.sum('Value').over(w))\
    .show()

Answer 1

Replace

f.sum('Value').over(w)

with

f.sum(
    f.when(f.col('Value') > 0, f.col('Value')).otherwise(0)
).over(w)

or even simpler (see spark dataframe groupping does not count nulls )

f.sum(
    f.when(f.col('Value') > 0, f.col('Value'))
).over(w)

How to use DataFrame.withColumn with a condition

Question

1 answers

solution1
1 2019-02-01 23:06:28

How to use DataFrame.withColumn with a condition

Question

1 answers

solution1 1 2019-02-01 23:06:28

solution1
1 2019-02-01 23:06:28