在 pyspark 中創建查找列

Question

我正在嘗試在 pyspark dataframe 中創建一個新列，該列“查找”同一 dataframe 中的下一個值，並將其復制到下一個事件發生之前。

我使用了如下使用的窗口函數，但在列上獲得下一個值仍然沒有運氣：

condition = (col("col2") == 'event_start_ind')
w=Window.partitionBy("col2").orderBy(*[when(condition, lit(1)).desc()])

df.select(["timestamp",
           "col1",
           "col2",
           "col3"
          ]).withColumn("col4", when(condition, lead("col3",1).over(w))) \
.orderBy("timestamp") \
.show(500, truncate=False)

顯然它不會正確查找“下一個”事件。 關於可能的方法的任何想法？

示例 dataframe 將是：

時間戳	col1	col2	col3
2021-02-02 01:03:55	s1	null	null
2021-02-02 01:04:16.952854	s1	other_ind	null
2021-02-02 01:04:32.398155	s1	null	null
2021-02-02 01:04:53.793089	s1	event_start_ind	event_1_value
2021-02-02 01:05:10.936913	s1	null	null
2021-02-02 01:05:36	s1	other_ind	null
2021-02-02 01:05:42	s1	null	null
2021-02-02 01:05:43	s1	null	null
2021-02-02 01:05:44	s1	event_start_ind	event_2_value
2021-02-02 01:05:46.623198	s1	null	null
2021-02-02 01:06:50	s1	null	null
2021-02-02 01:07:19.607685	s1	null	null

期望的結果是：

時間戳	col1	col2	col3	col4
2021-02-02 01:03:55	s1	null	null	event_1_value
2021-02-02 01:04:16.952854	s1	other_ind	null	event_1_value
2021-02-02 01:04:32.398155	s1	null	null	event_1_value
2021-02-02 01:04:53.793089	s1	event_start_ind	event_1_value	event_1_value
2021-02-02 01:05:10.936913	s1	null	null	event_2_value
2021-02-02 01:05:36	s1	other_ind	null	event_2_value
2021-02-02 01:05:42	s1	null	null	event_2_value
2021-02-02 01:05:43	s1	null	null	event_2_value
2021-02-02 01:05:44	s1	event_start_ind	event_2_value	event_2_value
2021-02-02 01:05:46.623198	s1	null	null	null
2021-02-02 01:06:50	s1	null	null	null
2021-02-02 01:07:19.607685	s1	null	null	null

Answer 1

看起來您的 window 沒有分區，並且事件沒有相同數量的記錄。 考慮到這一點，我想到的解決方案是使用每個事件開始的 position 來檢索各自的值。

考慮到按時間戳排序，我們提取每行的 position：

from pyspark.sql import Window
from pyspark.sql.functions import col, rank, collect_list, expr

df = (
  spark.createDataFrame(
    [
        { 'timestamp': '2021-02-02 01:03:55', 'col1': 's1' },
        { 'timestamp': '2021-02-02 01:04:16.952854', 'col1': 's1', 'col2': 'other_ind'},
        { 'timestamp': '2021-02-02 01:04:32.398155', 'col1': 's1'},
        { 'timestamp': '2021-02-02 01:04:53.793089', 'col1': 's1', 'col2': 'event_start_ind', 'col3': 'event_1_value'},
        { 'timestamp': '2021-02-02 01:05:10.936913', 'col1': 's1'},
        { 'timestamp': '2021-02-02 01:05:36', 'col1': 's1', 'col2': 'other_ind'},
        { 'timestamp': '2021-02-02 01:05:42', 'col1': 's1'},
        { 'timestamp': '2021-02-02 01:05:43', 'col1': 's1'},
        { 'timestamp': '2021-02-02 01:05:44', 'col1': 's1', 'col2': 'event_start_ind', 'col3': 'event_2_value'},
        { 'timestamp': '2021-02-02 01:05:46.623198', 'col1': 's1'},
        { 'timestamp': '2021-02-02 01:06:50', 'col1': 's1'},
        { 'timestamp': '2021-02-02 01:07:19.607685', 'col1': 's1'}
    ]
  )
  .withColumn('timestamp', col('timestamp').cast('timestamp'))
  .withColumn("line", rank().over(Window.orderBy("timestamp")))
)

df.show(truncate=False)

+----+--------------------------+---------------+-------------+----+
|col1|timestamp                 |col2           |col3         |line|
+----+--------------------------+---------------+-------------+----+
|s1  |2021-02-02 01:03:55       |null           |null         |1   |
|s1  |2021-02-02 01:04:16.952854|other_ind      |null         |2   |
|s1  |2021-02-02 01:04:32.398155|null           |null         |3   |
|s1  |2021-02-02 01:04:53.793089|event_start_ind|event_1_value|4   |
|s1  |2021-02-02 01:05:10.936913|null           |null         |5   |
|s1  |2021-02-02 01:05:36       |other_ind      |null         |6   |
|s1  |2021-02-02 01:05:42       |null           |null         |7   |
|s1  |2021-02-02 01:05:43       |null           |null         |8   |
|s1  |2021-02-02 01:05:44       |event_start_ind|event_2_value|9   |
|s1  |2021-02-02 01:05:46.623198|null           |null         |10  |
|s1  |2021-02-02 01:06:50       |null           |null         |11  |
|s1  |2021-02-02 01:07:19.607685|null           |null         |12  |
+----+--------------------------+---------------+-------------+----+

之后，我們確定每個事件的開始：

df_event_start = (
    df.filter(col("col2") == 'event_start_ind')
    .select(
        col("line").alias("event_start_line"),
        col("col3").alias("event_value")
    )
)
df_event_start.show()

+----------------+-------------+
|event_start_line|  event_value|
+----------------+-------------+
|               4|event_1_value|
|               9|event_2_value|
+----------------+-------------+

使用event_start信息來查找下一個有效的事件開始：

df_with_event_starts = (
    df.join(
        df_event_start.select(collect_list('event_start_line').alias("event_starts"))
    )
    .withColumn("next_valid_event", expr("element_at(filter(event_starts, x -> x >= line), 1)"))
)

df_with_event_starts.show(truncate=False)

+----+--------------------------+---------------+-------------+----+------------+----------------+
|col1|timestamp                 |col2           |col3         |line|event_starts|next_valid_event|
+----+--------------------------+---------------+-------------+----+------------+----------------+
|s1  |2021-02-02 01:03:55       |null           |null         |1   |[4, 9]      |4               |
|s1  |2021-02-02 01:04:16.952854|other_ind      |null         |2   |[4, 9]      |4               |
|s1  |2021-02-02 01:04:32.398155|null           |null         |3   |[4, 9]      |4               |
|s1  |2021-02-02 01:04:53.793089|event_start_ind|event_1_value|4   |[4, 9]      |4               |
|s1  |2021-02-02 01:05:10.936913|null           |null         |5   |[4, 9]      |9               |
|s1  |2021-02-02 01:05:36       |other_ind      |null         |6   |[4, 9]      |9               |
|s1  |2021-02-02 01:05:42       |null           |null         |7   |[4, 9]      |9               |
|s1  |2021-02-02 01:05:43       |null           |null         |8   |[4, 9]      |9               |
|s1  |2021-02-02 01:05:44       |event_start_ind|event_2_value|9   |[4, 9]      |9               |
|s1  |2021-02-02 01:05:46.623198|null           |null         |10  |[4, 9]      |null            |
|s1  |2021-02-02 01:06:50       |null           |null         |11  |[4, 9]      |null            |
|s1  |2021-02-02 01:07:19.607685|null           |null         |12  |[4, 9]      |null            |
+----+--------------------------+---------------+-------------+----+------------+----------------+

最后檢索正確的值：

(
    df_with_event_starts.join(
        df_event_start,
        col("next_valid_event") == col("event_start_line"),
        how="left"
    )
    .drop("line", "event_starts", "next_valid_event", "event_start_line")
    .show(truncate=False)
)

+----+--------------------------+---------------+-------------+-------------+
|col1|timestamp                 |col2           |col3         |event_value  |
+----+--------------------------+---------------+-------------+-------------+
|s1  |2021-02-02 01:03:55       |null           |null         |event_1_value|
|s1  |2021-02-02 01:04:16.952854|other_ind      |null         |event_1_value|
|s1  |2021-02-02 01:04:32.398155|null           |null         |event_1_value|
|s1  |2021-02-02 01:04:53.793089|event_start_ind|event_1_value|event_1_value|
|s1  |2021-02-02 01:05:10.936913|null           |null         |event_2_value|
|s1  |2021-02-02 01:05:36       |other_ind      |null         |event_2_value|
|s1  |2021-02-02 01:05:42       |null           |null         |event_2_value|
|s1  |2021-02-02 01:05:43       |null           |null         |event_2_value|
|s1  |2021-02-02 01:05:44       |event_start_ind|event_2_value|event_2_value|
|s1  |2021-02-02 01:05:46.623198|null           |null         |null         |
|s1  |2021-02-02 01:06:50       |null           |null         |null         |
|s1  |2021-02-02 01:07:19.607685|null           |null         |null         |
+----+--------------------------+---------------+-------------+-------------+

該解決方案會給您帶來處理大量數據的問題。 如果您能找出每個事件的關鍵，我建議您使用 window 函數繼續您的初始解決方案。 如果發生這種情況，您可以測試last或first sql function（忽略 null 值）。

希望有人會幫助您提供更好的解決方案。

提示：在問題中提供數據框創建腳本很有幫助。

在 pyspark 中創建查找列

問題描述

1 個解決方案

解決方案1
0 2022-02-04 11:00:17

在 pyspark 中創建查找列

問題描述

1 個解決方案

解決方案1 0 2022-02-04 11:00:17

解決方案1
0 2022-02-04 11:00:17