[英]How to delete from a temp view or equivalent in spark sql databricks?
[英]Databricks / Spark equivalent to lookup (done via CROSS APPLY in SQL)
我的用戶有一個小的“縮放因子”配置,他們希望將其應用於固定大小的表(50,000 行)。
這就是它當前的配置方式並變成了一個小的 dataframe:
rank_lbounds = [ 0, 101, 175, 250, 500, 50000]
scale_factors = [0.64, 0.6, 0.8, 0.99, 1.0, 1.0]
from pyspark.sql import Row
ScalingFactor = Row("rank_min", "scale_factor")
df_scaling_factors = spark.createDataFrame(
[ScalingFactor(rank, scale) for (rank, scale) in zip(rank_lbounds, scale_factors)])
這個想法是,一個結果集(50,000 行)將從最大到最小排序,然后前 100 個值將縮小 0.64 倍,接下來的 75 個縮小 0.6 倍,等等......
在 SQL 中,有效地做這種事情的“首選”方法顯然是CROSS APPLY
。 這是他們的解決方案:
SELECT SomeKey, SomeValue, SomeValue_Rank, ScaleFactor,
SomeValue_Scaled = (SomeValue * ScaleFactor)
FROM (
SELECT SomeKey, SomeValue, SomeValue_Rank,
T_FactorLookup.rank_min AS NextLowestRankLookup,
T_FactorLookup.Rank_ScaleFactor AS ScaleFactor
FROM (
SELECT SomeKey, SomeValue,
SomeValue_Rank = row_number() over(order by SomeValue desc, SomeKey)
FROM dbo.TableOfValuesToScale
) AS T_Ranked
CROSS APPLY(
SELECT TOP 1 rank_min, Rank_ScaleFactor
FROM Extortion_VoR_Scalefactors AS Factors
WHERE Factors.rank_min <= SomeValue_Rank
ORDER BY Factors.rank_min DESC
) T_FactorLookup
) T_WithScaleFactors
在嘗試將其移植到數據塊時,我不確定進行這種查找的最佳方法是什么。 我知道查找表總是很小(稀疏),所以在程序上,我不會對將其實現為雙嵌套 for 循環或帶過濾器的笛卡爾聯接感到不安,但我想使用最佳實踐,以免示例用於更大的數據集。
我考慮過的解決方案:
CROSS APPLY
)df_scaling_factors
表“分解”為 50,000 行表,並在row_number() over...
= rank_min
上進行簡單連接。我會 go 加入行號。 也許您可以在比例因子表中再添加一列以方便加入。
from pyspark.sql import Row, Window, functions as F
rank_lbounds = [ 0, 101, 175, 250, 500]
rank_ubounds = [ 100, 174, 249, 499, 50000]
scale_factors = [0.64, 0.6, 0.8, 0.99, 1.0]
ScalingFactor = Row("rank_min", "rank_max", "scale_factor")
df_scaling_factors = spark.createDataFrame(
[ScalingFactor(rankl, ranku, scale)
for (rankl, ranku, scale) in zip(rank_lbounds, rank_ubounds, scale_factors)])
df2 = df.withColumn('rn', F.row_number().over(Window.orderBy('value')))
joined = df2.join(
df_scaling_factors,
(df2.rn >= df_scaling_factors.rank_min) & (df2.rn <= df_scaling_factors.rank_max)
)
joined2 = joined.withColumn('scaled_values', F.col('scale_factors') * F.col('value'))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.