如何為數據的唯一子集分別應用 F.when 條件

Question

我想對我的數據子集應用條件。 在示例中，我想分別在col1中的“A”和“B”上使用 F.when，並返回包含“A”和“B”的 DataFrame 並應用了條件。

我曾嘗試使用 group by 來執行此操作，但我對聚合數據不感興趣，我想在應用條件之前和之后返回相同數量的行。

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("test").getOrCreate()

spark.createDataFrame(pd.DataFrame({"col1": ["A", "A", "A", "B", "B"], "score": [1,2,3,1,2] }))

condition = F.when(F.col("score") > 2, 1).otherwise(0)

有人對如何解決這個問題有任何建議嗎？ 下面是我預期的 output，但至關重要的是，條件分別應用於“A”和“B”，因為我的實際用例與提供的玩具示例有點不同。

Answer 1

嘗試：

df.select(df.col1, df.score, condition.alias("send")).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# |   A|    1|   0|
# |   A|    2|   0|
# |   A|    3|   1|
# |   B|    1|   0|
# |   B|    2|   0|
# +----+-----+----+

（見： pyspark.sql.Column.when ）

要根據行值應用多個條件，請使用：

from pyspark.sql.functions import when
df.withColumn("send", when((df.col1 == "A") & (F.col("score") > 2), 1)
                     .when((df.col1 == "B") & (F.col("score") > 1), 1)
                     .otherwise(0)
             ).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# |   A|    1|   0|
# |   A|    2|   0|
# |   A|    3|   1|
# |   B|    1|   0|
# |   B|    2|   1|
# +----+-----+----+

（ pyspark.sql.functions.when ）

如何為數據的唯一子集分別應用 F.when 條件

問題描述

1 個解決方案

解決方案1
1 2022-01-12 16:09:27

如何為數據的唯一子集分別應用 F.when 條件

問題描述

1 個解決方案

解決方案1 1 2022-01-12 16:09:27

解決方案1
1 2022-01-12 16:09:27