时间序列不一致的pyspark滞后函数

Question

import pyspark.sql.functions as F
from pyspark.sql.window import Window

I would like to use a window function to find the value from a column 4 periods ago.我想使用窗口函数从 4 个周期前的列中查找值。

Suppose my data (df) looks like this (in reality i have many different IDs):假设我的数据 (df) 看起来像这样（实际上我有许多不同的 ID）：

ID | value | period

a  |  100  |   1   
a  |  200  |   2   
a  |  300  |   3   
a  |  400  |   5   
a  |  500  |   6   
a  |  600  |   7

If the time series was consistent (eg period 1-6) I could just use F.lag(df['value'], count=4).over(Window.partitionBy('id').orderBy('period'))如果时间序列是一致的（例如周期 1-6），我可以使用F.lag(df['value'], count=4).over(Window.partitionBy('id').orderBy('period'))

However, because the time series has a discontinuity, the values would be displaced.但是，由于时间序列具有不连续性，因此这些值将被置换。

My desired output would be this:我想要的输出是这样的：

ID | value | period | 4_lag_value
a  |  100  |   1    |     nan
a  |  200  |   2    |     nan 
a  |  300  |   3    |     nan
a  |  400  |   5    |     100
a  |  500  |   6    |     200
a  |  600  |   7    |     300

How can I do this in pyspark?我怎样才能在 pyspark 中做到这一点？

Answer 1

This is probably what you're looking for:这可能是你正在寻找的：

from pyspark.sql import Window, functions as F

def pyspark_timed_lag_values(df, lags, avg_diff, state_id='state_id', ds='ds', y='y'):

    interval_expr = 'sequence(min_ds, max_ds, interval {0} day)'.format(avg_diff)
    all_comb = (df.groupBy(F.col(state_id))
                .agg(F.min(ds).alias('min_ds'), F.max(ds).alias('max_ds'))
                .withColumn(ds, F.explode(F.expr(interval_expr)))
                .select(*[state_id, ds]))

    all_comb = all_comb.join(df.withColumn('exists', F.lit(True)), on=[state_id, ds], how='left')
    window = Window.partitionBy(state_id).orderBy(F.col(ds).asc())
    for lag in lags:
        all_comb = all_comb.withColumn("{0}_{1}".format(y, lag), F.lag(y, lag).over(window))

    all_comb = all_comb.filter(F.col('exists')).drop(*['exists'])
    return all_comb

Let's apply it on an example:让我们将其应用于示例：

data = spark.sparkContext.parallelize([
        (1,"2021-01-03",100),
        (1,"2021-01-10",830),
        (1,"2021-01-17",300),
        (1,"2021-02-07",450),
        (2,"2021-01-03",500),
        (2,"2021-01-17",800),
        (2,"2021-02-14",800)])


example = spark.createDataFrame(data, ['state_id','ds','y'])
example = example.withColumn('ds', F.to_date(F.col('ds')))

lags = list(range(1, n_periods + 1))
result = timed_lag_values(example, lags = lags, avg_diff = 7)

Resulting in the following result:导致以下结果：

+--------+----------+---+----+----+----+----+----+----+----+
|state_id|        ds|  y| y_1| y_2| y_3| y_4| y_5| y_6| y_7|
+--------+----------+---+----+----+----+----+----+----+----+
|       1|2021-01-03|100|null|null|null|null|null|null|null|
|       1|2021-01-10|830| 100|null|null|null|null|null|null|
|       1|2021-01-17|300| 830| 100|null|null|null|null|null|
|       1|2021-02-07|450|null|null| 300| 830| 100|null|null|
|       2|2021-01-03|500|null|null|null|null|null|null|null|
|       2|2021-01-17|800|null| 500|null|null|null|null|null|
|       2|2021-02-14|800|null|null|null| 800|null| 500|null|
+--------+----------+---+----+----+----+----+----+----+----+

Right now the it's prepared for dates, but with a small adaptation it should be applicable to various use cases.现在它已经为日期做好了准备，但经过一些小的调整，它应该适用于各种用例。 In this case the disadvantage is the necessary application of explode to create all possible date combinations and the creation of the helper DataFrame all_comb .在这种情况下，缺点是需要使用 expand 来创建所有可能的日期组合并创建助手 DataFrame all_comb 。

The real benefit of this solution is it's applicable to most use cases dealing with time-series, as the parameter avg_diff defines the expected distance between time-periods.这个解决方案的真正好处是它适用于大多数处理时间序列的用例，因为参数avg_diff定义了时间段之间的预期距离。

Just to mention, there is probably a cleaner Hive SQL alternative to this.只是提一下，可能有一个更干净的 Hive SQL 替代方案。

Answer 2

I've come up with a solution, but it seems unnecessarily ugly, would welcome anything better!我想出了一个解决方案，但它似乎不必要地丑陋，欢迎任何更好的东西！

data = spark.sparkContext.parallelize([
        ('a',100,1),
        ('a',200,2),
        ('a',300,3),
        ('a',400,5),
        ('a',500,6),
        ('a',600,7)])

df = spark.createDataFrame(data, ['id','value','period'])

window = Window.partitionBy('id').orderBy('period')

# look 1, 2, 3 and 4 rows behind:
for diff in [1,2,3,4]:
    df = df.withColumn('{}_diff'.format(diff),
                       df['period'] - F.lag(df['period'], count=diff).over(window))

# if any of these are 4, that's the lag we need
# if not, there is no 4 period lagged return, so return None

#initialise col
df = df.withColumn('4_lag_value', F.lit(None))
# loop:
for diff in [1,2,3,4]:
    df = df.withColumn('4_lag_value',
                       F.when(df['{}_diff'.format(diff)] == 4,
                                 F.lag(df['value'], count=diff).over(window))
                              .otherwise(df['4_lag_value']))

# drop working cols
df = df.drop(*['{}_diff'.format(diff) for diff in [1,2,3,4]])

This returns the desired output.这将返回所需的输出。

时间序列不一致的pyspark滞后函数

问题描述

2 个解决方案

解决方案1
1 2021-06-17 12:19:15

解决方案2
0 2018-11-20 15:51:15

时间序列不一致的pyspark滞后函数

问题描述

2 个解决方案

解决方案1 1 2021-06-17 12:19:15

解决方案2 0 2018-11-20 15:51:15

解决方案1
1 2021-06-17 12:19:15

解决方案2
0 2018-11-20 15:51:15