如何为数据的唯一子集分别应用 F.when 条件

Question

我想对我的数据子集应用条件。 在示例中，我想分别在col1中的“A”和“B”上使用 F.when，并返回包含“A”和“B”的 DataFrame 并应用了条件。

我曾尝试使用 group by 来执行此操作，但我对聚合数据不感兴趣，我想在应用条件之前和之后返回相同数量的行。

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("test").getOrCreate()

spark.createDataFrame(pd.DataFrame({"col1": ["A", "A", "A", "B", "B"], "score": [1,2,3,1,2] }))

condition = F.when(F.col("score") > 2, 1).otherwise(0)

有人对如何解决这个问题有任何建议吗？ 下面是我预期的 output，但至关重要的是，条件分别应用于“A”和“B”，因为我的实际用例与提供的玩具示例有点不同。

Answer 1

尝试：

df.select(df.col1, df.score, condition.alias("send")).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# |   A|    1|   0|
# |   A|    2|   0|
# |   A|    3|   1|
# |   B|    1|   0|
# |   B|    2|   0|
# +----+-----+----+

（见： pyspark.sql.Column.when ）

要根据行值应用多个条件，请使用：

from pyspark.sql.functions import when
df.withColumn("send", when((df.col1 == "A") & (F.col("score") > 2), 1)
                     .when((df.col1 == "B") & (F.col("score") > 1), 1)
                     .otherwise(0)
             ).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# |   A|    1|   0|
# |   A|    2|   0|
# |   A|    3|   1|
# |   B|    1|   0|
# |   B|    2|   1|
# +----+-----+----+

（ pyspark.sql.functions.when ）

如何为数据的唯一子集分别应用 F.when 条件

问题描述

1 个解决方案

解决方案1
1 2022-01-12 16:09:27

如何为数据的唯一子集分别应用 F.when 条件

问题描述

1 个解决方案

解决方案1 1 2022-01-12 16:09:27

解决方案1
1 2022-01-12 16:09:27