[英]SPARK 3 - Populate value with value from previous rows (lookup)
我是 SPARK 的新手。 我有 2 個數據框events
和players
事件 dataframe 由列組成
event_id| player_id| match_id| impact_score
播放器 dataframe 由立柱組成
player_id| player_name| nationality
我將player_id
與此查詢的兩個數據集合並:
df_final = (events
.orderBy("player_id")
.join(players.orderBy("player_id"))
.withColumn("current_team", when([no idea what goes in here]).otherwise(getCurrentTeam(col("player_id"))))
.write.mode("overwrite")
.partitionBy("current_team")
)
getCurrentTeam
function 觸發 HTTP 調用,該調用返回一個值(玩家的當前團隊)。
我有超過 3000 萬次足球比賽和 97 名球員的數據。 我需要幫助創建列current_team
。 想象一下某個玩家在 dataframe 事件中出現了 130,000 次。 我需要從前幾行中查找值。 如果播放器出現,我只需獲取該值(如內存目錄)。 如果它沒有出現,那么我調用 web 服務。
由於它的分布式特性,Spark 不能允許如果允許在之前的調用中填充然后使用它,否則調用創建的值。 有兩種可能的選擇。
players
df 具有所有不同玩家的列表,因此您可以在應用連接之前將current_team
列添加到此 df。 如果players
df 在加入之前被緩存,那么每個玩家可能只調用一次UDF
。 請參閱此處的討論,了解為什么可以為每條記錄多次調用 UDF。getCurrentTeam
getCurrentTeamcurrent_team
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
@udf(StringType())
def getCurrentTeam(player_id):
return f"player_{player_id}_team"
players_with_current_team = players.withColumn("current_team", getCurrentTeam(F.col("player_id"))).cache()
events.join(players_with_current_team, ["player_id"]).show()
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
我使用 python dict 來模擬緩存並使用accumulator
來計算模擬網絡調用的數量。
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import time
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
players_events_joined = events.join(players, ["player_id"])
memoized_call_counter = spark.sparkContext.accumulator(0)
def memoize_call():
cache = {}
def getCurrentTeam(player_id):
global memoized_call_counter
cached_value = cache.get(player_id, None)
if cached_value is not None:
return cached_value
# sleep to mimic network call
time.sleep(1)
# Increment counter everytime cached value can't be lookedup
memoized_call_counter.add(1)
cache[player_id] = f"player_{player_id}_team"
return cache[player_id]
return getCurrentTeam
getCurrentTeam_udf = udf(memoize_call(), StringType())
players_events_joined.withColumn("current_team", getCurrentTeam_udf(F.col("player_id"))).show()
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
>>> memoized_call_counter.value
3
由於總共有 3 個獨特的玩家,
time.sleep(1)
之后的邏輯只被調用了三次。 調用次數取決於工作人員的數量,因為緩存不是在工作人員之間共享的。 當我在本地模式下運行示例時(有 1 個工作人員),我們看到調用的數量等於工作人員的數量。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.