I want to apply a condition over subsets of my data. In the example, I want to use F.when over "A" and "B" from col1
separately, and return the a DataFrame that contains both "A" and "B" with the condition applied.
I have tried to use a group by to do this, but I'm not interested in aggregating the data, I want to return the same number of rows before and after the condition is applied.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
spark.createDataFrame(pd.DataFrame({"col1": ["A", "A", "A", "B", "B"], "score": [1,2,3,1,2] }))
condition = F.when(F.col("score") > 2, 1).otherwise(0)
Does anyone have any advice as to how to solve this problem? Below is my expected output, but it is crucial that the condition is applied over "A" and "B" separately, as my actual use case is a bit different than the toy example supplied.
Try with:
df.select(df.col1, df.score, condition.alias("send")).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# | A| 1| 0|
# | A| 2| 0|
# | A| 3| 1|
# | B| 1| 0|
# | B| 2| 0|
# +----+-----+----+
(see: pyspark.sql.Column.when
)
To apply multiple conditions depending on the row values use:
from pyspark.sql.functions import when
df.withColumn("send", when((df.col1 == "A") & (F.col("score") > 2), 1)
.when((df.col1 == "B") & (F.col("score") > 1), 1)
.otherwise(0)
).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# | A| 1| 0|
# | A| 2| 0|
# | A| 3| 1|
# | B| 1| 0|
# | B| 2| 1|
# +----+-----+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.