[英]pyspark lag function with inconsistent time series
import pyspark.sql.functions as F
from pyspark.sql.window import Window
我想使用窗口函數從 4 個周期前的列中查找值。
假設我的數據 (df) 看起來像這樣(實際上我有許多不同的 ID):
ID | value | period
a | 100 | 1
a | 200 | 2
a | 300 | 3
a | 400 | 5
a | 500 | 6
a | 600 | 7
如果時間序列是一致的(例如周期 1-6),我可以使用F.lag(df['value'], count=4).over(Window.partitionBy('id').orderBy('period'))
但是,由於時間序列具有不連續性,因此這些值將被置換。
我想要的輸出是這樣的:
ID | value | period | 4_lag_value
a | 100 | 1 | nan
a | 200 | 2 | nan
a | 300 | 3 | nan
a | 400 | 5 | 100
a | 500 | 6 | 200
a | 600 | 7 | 300
我怎樣才能在 pyspark 中做到這一點?
這可能是你正在尋找的:
from pyspark.sql import Window, functions as F
def pyspark_timed_lag_values(df, lags, avg_diff, state_id='state_id', ds='ds', y='y'):
interval_expr = 'sequence(min_ds, max_ds, interval {0} day)'.format(avg_diff)
all_comb = (df.groupBy(F.col(state_id))
.agg(F.min(ds).alias('min_ds'), F.max(ds).alias('max_ds'))
.withColumn(ds, F.explode(F.expr(interval_expr)))
.select(*[state_id, ds]))
all_comb = all_comb.join(df.withColumn('exists', F.lit(True)), on=[state_id, ds], how='left')
window = Window.partitionBy(state_id).orderBy(F.col(ds).asc())
for lag in lags:
all_comb = all_comb.withColumn("{0}_{1}".format(y, lag), F.lag(y, lag).over(window))
all_comb = all_comb.filter(F.col('exists')).drop(*['exists'])
return all_comb
讓我們將其應用於示例:
data = spark.sparkContext.parallelize([
(1,"2021-01-03",100),
(1,"2021-01-10",830),
(1,"2021-01-17",300),
(1,"2021-02-07",450),
(2,"2021-01-03",500),
(2,"2021-01-17",800),
(2,"2021-02-14",800)])
example = spark.createDataFrame(data, ['state_id','ds','y'])
example = example.withColumn('ds', F.to_date(F.col('ds')))
lags = list(range(1, n_periods + 1))
result = timed_lag_values(example, lags = lags, avg_diff = 7)
導致以下結果:
+--------+----------+---+----+----+----+----+----+----+----+
|state_id| ds| y| y_1| y_2| y_3| y_4| y_5| y_6| y_7|
+--------+----------+---+----+----+----+----+----+----+----+
| 1|2021-01-03|100|null|null|null|null|null|null|null|
| 1|2021-01-10|830| 100|null|null|null|null|null|null|
| 1|2021-01-17|300| 830| 100|null|null|null|null|null|
| 1|2021-02-07|450|null|null| 300| 830| 100|null|null|
| 2|2021-01-03|500|null|null|null|null|null|null|null|
| 2|2021-01-17|800|null| 500|null|null|null|null|null|
| 2|2021-02-14|800|null|null|null| 800|null| 500|null|
+--------+----------+---+----+----+----+----+----+----+----+
現在它已經為日期做好了准備,但經過一些小的調整,它應該適用於各種用例。 在這種情況下,缺點是需要使用 expand 來創建所有可能的日期組合並創建助手 DataFrame all_comb
。
這個解決方案的真正好處是它適用於大多數處理時間序列的用例,因為參數avg_diff
定義了時間段之間的預期距離。
只是提一下,可能有一個更干凈的 Hive SQL 替代方案。
我想出了一個解決方案,但它似乎不必要地丑陋,歡迎任何更好的東西!
data = spark.sparkContext.parallelize([
('a',100,1),
('a',200,2),
('a',300,3),
('a',400,5),
('a',500,6),
('a',600,7)])
df = spark.createDataFrame(data, ['id','value','period'])
window = Window.partitionBy('id').orderBy('period')
# look 1, 2, 3 and 4 rows behind:
for diff in [1,2,3,4]:
df = df.withColumn('{}_diff'.format(diff),
df['period'] - F.lag(df['period'], count=diff).over(window))
# if any of these are 4, that's the lag we need
# if not, there is no 4 period lagged return, so return None
#initialise col
df = df.withColumn('4_lag_value', F.lit(None))
# loop:
for diff in [1,2,3,4]:
df = df.withColumn('4_lag_value',
F.when(df['{}_diff'.format(diff)] == 4,
F.lag(df['value'], count=diff).over(window))
.otherwise(df['4_lag_value']))
# drop working cols
df = df.drop(*['{}_diff'.format(diff) for diff in [1,2,3,4]])
這將返回所需的輸出。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.