SPARK 3 - 用前幾行的值填充值（查找）

Question

我是 SPARK 的新手。 我有 2 個數據框events和players

事件 dataframe 由列組成

event_id| player_id| match_id| impact_score

播放器 dataframe 由立柱組成

player_id| player_name| nationality

我將player_id與此查詢的兩個數據集合並：

df_final = (events
  .orderBy("player_id") 
  .join(players.orderBy("player_id"))
  .withColumn("current_team", when([no idea what goes in here]).otherwise(getCurrentTeam(col("player_id"))))
  .write.mode("overwrite")
  .partitionBy("current_team")
)

getCurrentTeam function 觸發 HTTP 調用，該調用返回一個值（玩家的當前團隊）。

我有超過 3000 萬次足球比賽和 97 名球員的數據。 我需要幫助創建列current_team 。 想象一下某個玩家在 dataframe 事件中出現了 130,000 次。 我需要從前幾行中查找值。 如果播放器出現，我只需獲取該值（如內存目錄）。 如果它沒有出現，那么我調用 web 服務。

Answer 1

由於它的分布式特性，Spark 不能允許如果允許在之前的調用中填充然后使用它，否則調用創建的值。 有兩種可能的選擇。

由於您正在應用內部連接並且players df 具有所有不同玩家的列表，因此您可以在應用連接之前將current_team列添加到此 df。 如果players df 在加入之前被緩存，那么每個玩家可能只調用一次UDF 。 請參閱此處的討論，了解為什么可以為每條記錄多次調用 UDF。
你可以getCurrentTeam getCurrentTeam

工作示例 - 預填充`current_team`

from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]

events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)


@udf(StringType())
def getCurrentTeam(player_id):
    return f"player_{player_id}_team"

players_with_current_team = players.withColumn("current_team", getCurrentTeam(F.col("player_id"))).cache()

events.join(players_with_current_team, ["player_id"]).show()

Output

+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
|        2|       2|       1|          20|    Player2|        Nat|player_2_team|
|        2|       1|       1|          20|    Player2|        Nat|player_2_team|
|        3|       2|       1|          30|    Player3|        Nat|player_3_team|
|        3|       1|       1|          30|    Player3|        Nat|player_3_team|
|        1|       2|       1|          10|    Player1|        Nat|player_1_team|
|        1|       1|       1|          10|    Player1|        Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+

工作示例 - 記憶

我使用 python dict 來模擬緩存並使用accumulator來計算模擬網絡調用的數量。

from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import time

events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]

events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)

players_events_joined = events.join(players, ["player_id"])

memoized_call_counter = spark.sparkContext.accumulator(0)
def memoize_call():
    cache = {}
    def getCurrentTeam(player_id):
        global memoized_call_counter
        cached_value = cache.get(player_id, None)
        if cached_value is not None:
            return cached_value
        # sleep to mimic network call
        time.sleep(1)
        # Increment counter everytime cached value can't be lookedup
        memoized_call_counter.add(1)
        cache[player_id] = f"player_{player_id}_team"
        return cache[player_id]
    return getCurrentTeam
    
getCurrentTeam_udf = udf(memoize_call(), StringType())

players_events_joined.withColumn("current_team", getCurrentTeam_udf(F.col("player_id"))).show()

Output

+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
|        2|       2|       1|          20|    Player2|        Nat|player_2_team|
|        2|       1|       1|          20|    Player2|        Nat|player_2_team|
|        3|       2|       1|          30|    Player3|        Nat|player_3_team|
|        3|       1|       1|          30|    Player3|        Nat|player_3_team|
|        1|       2|       1|          10|    Player1|        Nat|player_1_team|
|        1|       1|       1|          10|    Player1|        Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+

>>> memoized_call_counter.value
3

由於總共有 3 個獨特的玩家， time.sleep(1)之后的邏輯只被調用了三次。 調用次數取決於工作人員的數量，因為緩存不是在工作人員之間共享的。 當我在本地模式下運行示例時（有 1 個工作人員），我們看到調用的數量等於工作人員的數量。

SPARK 3 - 用前幾行的值填充值（查找）

問題描述

1 個解決方案

解決方案1
0 2021-12-12 10:15:16

工作示例 - 預填充`current_team`

Output

工作示例 - 記憶

Output

SPARK 3 - 用前幾行的值填充值（查找）

問題描述

1 個解決方案

解決方案1 0 2021-12-12 10:15:16

工作示例 - 預填充current_team

Output

工作示例 - 記憶

Output

解決方案1
0 2021-12-12 10:15:16

工作示例 - 預填充`current_team`