用另一列的值指定滞后期

Question

我有表并希望获得另外 2 个列（column1 和 column2），其预期值如下：

partition part_index value1 (column1) (column2)
1         1           1      1        null
1         2           1.5    1        null
1         3           3      1        null
1         4           5      1        null
1         5           6      1        null
2         1           5      5        6
2         2           2      5        6
2         3           3      5        6 
2         4           4      5        6
2         5           5      5        6
3         1           6      6        5
3         2           5.5    6        5
3         3           5      6        5
3         4           4.5    6        5 
3         5           4      6        5
4         1           6      6        4
4         2           10     6        4
4         3           2      6        4
4         4           3      6        4
4         5           4      6        4

我尝试通过滞后 function 获取 column1，偏移量由指定 window 上的其他列的值给出，但出现错误：TypeError：列不可迭代。 下面是我的 function：

from pyspark.sql import functions as f
from pyspark.sql import Window
window1=Window.partitionBy("partition").orderBy("part_index")
data.withColumn("column1", f.lag(f.col("column1"),\
                                  f.col("part_index")-1)\
                                  .over(window1)).show()

如何正确指定其他列的值的偏移量？

其次，我想从前一个分区中获取具有 value1 的最后一个值的 column2。 我认为解决方案应该类似于第一个，但我不知道如何引用给定列的前一个分区的最后一个值。

Answer 1

如果我理解正确，column1 的逻辑是按 part_index 列排序的分区中 value1 的第一个值。
这是我的解决方案：

import pyspark.sql.functions as f 
from pyspark.sql import Window
window_spec = Window.partitionBy('partition').orderBy('part_index').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
column2_window = Window.partitionBy().orderBy('partition').rangeBetween(Window.unboundedPreceding, Window.currentRow)
data = (data
        .withColumn('column1', f.first(f.col('value1')).over(window_spec))
        .withColumn('last_value_of_partition', f.last('value1').over(window_spec))
        .withColumn('last_values_list', f.collect_list('last_value_of_partition').over(column2_window))
        .withColumn('column2', f.element_at(f.expr('filter(last_values_list, element -> element <> last_value_of_partition)'), -1))
        .select(data['*'], 'column1', 'column2')
)
data.show()

输入：

+---------+----------+------+                                                   
|partition|part_index|value1|
+---------+----------+------+
|        1|         1|     1|
|        1|         2|    11|
|        2|         1|     1|
|        2|         2|    12|
|        3|         1|     1|
|        3|         2|    13|
+---------+----------+------+

output：

+---------+----------+------+-------+-------+                                   
|partition|part_index|value1|column1|column2|
+---------+----------+------+-------+-------+
|        1|         1|     1|      1|   null|
|        1|         2|    11|      1|   null|
|        2|         1|     1|      1|     11|
|        2|         2|    12|      1|     11|
|        3|         1|     1|      1|     12|
|        3|         2|    13|      1|     12|
+---------+----------+------+-------+-------+

Answer 2

经过昨天晚上的一些头脑风暴后，我通过计算 2 个额外的 df_help dataframe 然后将它们加入到主 df.*time_index 是我在主帖中使用的 part_index 产生了预期的结果。 下面是我的代码：

import pyspark.sql.functions as f 
from pyspark.sql import Window
#Specify window for 2 new columns
windowSpec_refPrice=Window.partitionBy("index")

#Calculate new column1 by creating new df_help and after joining to main df
df_help=dane.withColumn("min", f.min("index_time").over(windowSpec_refPrice))\
.where(f.col("index_time")==f.col("index_time")).select("index", "Value1")\
.withColumnRenamed("Value1", "Column1")
df=df.join(df_help, ["index"], how="inner")

#Calculate new column2 by creating new df_help and after joining to main df
df_help=dane.withColumn("max", f.max("index_time").over(windowSpec_refPrice))\
.where(f.col("index_time")==f.col("max")).select("index", "Value1")\
.withColumnRenamed("Value1", "Column2")
df_help=df_help.withColumn("Column2", f.lag("Column2", 1).over(Window.partitionBy(f.lit(1)).orderBy(f.lit(1))))\
.select("index", "Column2")
df=df.join(df_help, ["index"], how="inner")

用另一列的值指定滞后期

问题描述

2 个解决方案

解决方案1
0 2022-01-09 00:16:53

解决方案2
0 2022-01-09 11:42:11

用另一列的值指定滞后期

问题描述

2 个解决方案

解决方案1 0 2022-01-09 00:16:53

解决方案2 0 2022-01-09 11:42:11

解决方案1
0 2022-01-09 00:16:53

解决方案2
0 2022-01-09 11:42:11