PySpark：對於每一行，根據條件計算另一個表

Question

對於表 1 中的每一行，我正在嘗試計算表 2 中的行，並根據表 1 中的值滿足條件。

表 1 中的年齡應介於表 2 的 StartAge 和 EndAge 之間，或者等於 StartAge 和 EndAge。

這可以使用 udf 和 withColumn 嗎？ 我嘗試了幾種方法來做到這一點，例如將 withColumn 和 withColumn 與 UDF 一起使用，但兩種方法都失敗了。

def counter(a):
    return table2.where((table2.StartAge <= a) & (table2.EndAge >=a)).count()

counter_udf = udf(lambda age: counter(age), IntegerType())

table1 = table1.withColumn('Count', counter_udf('Age ID'))

這有意義嗎？ 謝謝。

示例輸入和 output：

Answer 1

看一下這個。 您可以使用 spark-sql 實現它。

    from pyspark.sql import SparkSession

    spark = SparkSession.builder \
        .appName('SO')\
        .getOrCreate()

    sc= spark.sparkContext

    df = sc.parallelize([([3]), ([4]), ([5])]).toDF(["age"])

    df1 = spark.createDataFrame([(0, 10), (7, 15), (5, 10), (3, 20), (5, 35), (4, 5),]
                           , ['age_start', 'age_end'])

    df.createTempView("table1")

    df1.createTempView("table2")



    spark.sql('select  t1.age as age_id, count(*) as count from table1 t1 join table2  t2 on  t1.age >=t2.age_start and t1.age<=t2.age_end group by t1.age order by count').show()

    # +------+-----+
    # |age_id|count|
    # +------+-----+
    # |     3|    2|
    # |     4|    3|
    # |     5|    5|
    # +------+-----+

Answer 2

如果你想在你的腳本中使用 UDF，你必須先用 spark 注冊它。

使用這行代碼應該有助於修復您的錯誤：

_ = spark.udf.register("counter_udf", counter_udf)

PySpark：對於每一行，根據條件計算另一個表

問題描述

2 個解決方案

解決方案1
1 已采納 2020-07-27 19:11:46

解決方案2
-1 2020-07-27 18:40:25

PySpark：對於每一行，根據條件計算另一個表

問題描述

2 個解決方案

解決方案1 1 已采納 2020-07-27 19:11:46

解決方案2 -1 2020-07-27 18:40:25

解決方案1
1 已采納 2020-07-27 19:11:46

解決方案2
-1 2020-07-27 18:40:25