How to apply F.when condition separately for unique subsets of the data

Question

I want to apply a condition over subsets of my data. In the example, I want to use F.when over "A" and "B" from col1 separately, and return the a DataFrame that contains both "A" and "B" with the condition applied.

I have tried to use a group by to do this, but I'm not interested in aggregating the data, I want to return the same number of rows before and after the condition is applied.

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("test").getOrCreate()

spark.createDataFrame(pd.DataFrame({"col1": ["A", "A", "A", "B", "B"], "score": [1,2,3,1,2] }))

condition = F.when(F.col("score") > 2, 1).otherwise(0)

Does anyone have any advice as to how to solve this problem? Below is my expected output, but it is crucial that the condition is applied over "A" and "B" separately, as my actual use case is a bit different than the toy example supplied.

Answer 1

Try with:

df.select(df.col1, df.score, condition.alias("send")).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# |   A|    1|   0|
# |   A|    2|   0|
# |   A|    3|   1|
# |   B|    1|   0|
# |   B|    2|   0|
# +----+-----+----+

(see: pyspark.sql.Column.when )

To apply multiple conditions depending on the row values use:

from pyspark.sql.functions import when
df.withColumn("send", when((df.col1 == "A") & (F.col("score") > 2), 1)
                     .when((df.col1 == "B") & (F.col("score") > 1), 1)
                     .otherwise(0)
             ).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# |   A|    1|   0|
# |   A|    2|   0|
# |   A|    3|   1|
# |   B|    1|   0|
# |   B|    2|   1|
# +----+-----+----+

( pyspark.sql.functions.when )

How to apply F.when condition separately for unique subsets of the data

Question

1 answers

solution1
1 2022-01-12 16:09:27

How to apply F.when condition separately for unique subsets of the data

Question

1 answers

solution1 1 2022-01-12 16:09:27

solution1
1 2022-01-12 16:09:27