简体   繁体   English

Pyspark 跨行保留值?

[英]Pyspark retain value across rows?

I have a problem which is naturally solved using a row-by-row SAS approach, but I'm stuck using Pyspark.我有一个使用逐行 SAS 方法自然解决的问题,但我坚持使用 Pyspark。 I have a dataset of events for people ordered by time, for example:我有一个按时间排序的人的事件数据集,例如:

test_df = pd.DataFrame({'event_list':[["H"], ["H"], ["H","F"], ["F"], ["F"], ["H"], ["W"], ["W"]], 'time_order':[1,2,3,4,5,6,7,8], 'person':[1,1,1,1,1,1,1,1]})
test_df = spark.createDataFrame(test_df)
test_df.show()

+----------+----------+------+
|event_list|time_order|person|
+----------+----------+------+
|       [H]|         1|     1|
|       [H]|         2|     1|
|    [H, F]|         3|     1|
|       [F]|         4|     1|
|       [F]|         5|     1|
|       [H]|         6|     1|
|       [W]|         7|     1|
|       [W]|         8|     1|
+----------+----------+------+

I want to group these events into episodes where all events following the initial event are part of the initial event list.我想将这些事件分成几集,其中初始事件之后的所有事件都是初始事件列表的一部分。 Therefore in my test_df I would expect 3 episodes:因此,在我的 test_df 中,我预计会有 3 集:

+----------+----------+------+-------+
|event_list|time_order|person|episode|
+----------+----------+------+-------+
|       [H]|         1|     1|      1|
|       [H]|         2|     1|      1|
|    [H, F]|         3|     1|      2|
|       [F]|         4|     1|      2|
|       [F]|         5|     1|      2|
|       [H]|         6|     1|      2|
|       [W]|         7|     1|      3|
|       [W]|         8|     1|      3|
+----------+----------+------+-------+

In SAS I would retain the prior row's value for event_list , and if the current event_list is contained in the prior event_list, I would retain the current event_list value rather than the prior event_list.在 SAS 中,我将保留event_list的先前行的值,如果当前 event_list 包含在先前 event_list 中,我将保留当前 event_list 值而不是先前 event_list。 Eg my retained values would be [null, ["H"], ["H"], ["H","F"], ["H","F"], ["H","F"], ["W"]].例如,我保留的值将是 [null, ["H"], ["H"], ["H","F"], ["H","F"], ["H","F"] , ["W"]]。 Then I can generate the episodes by tracking changes in the retained values.然后我可以通过跟踪保留值的变化来生成剧集。

In Pyspark I'm not sure how to retain information sequentially across row operations...is this even possible?在 Pyspark 中,我不确定如何跨行操作按顺序保留信息......这甚至可能吗? My attempts using window functions (partitioning by person and ordering by time_order ) have failed.我尝试使用窗口函数(按person分区和按time_order排序)失败了。 How can I solve this problem in Pyspark?如何在 Pyspark 中解决这个问题?

If you are using spark version >= 2.4, use collect_list on event_list column over window, flatten them, remove duplicates using array_distinct and finally use size to count how many distinct events over time.如果您使用的是 spark 版本 >= 2.4, collect_list在窗口上的event_list列上使用collect_list ,将它们flatten ,使用array_distinct删除重复array_distinct ,最后使用size来计算随着时间的推移有多少不同的事件。 It would be something like this :它会是这样的:

from pyspark.sql.functions import col, collect_list, flatten, array_distinct, size
from pyspark.sql.window import Window

w = Window.partitionBy('person').orderBy('time_order').rowsBetween(Window.unboundedPreceding, 0)

test_df = test_df.withColumn('episode', size(array_distinct(flatten(collect_list(col('event_list')).over(w)))))
test_df.show()

+----------+----------+------+-------+
|event_list|time_order|person|episode|
+----------+----------+------+-------+
|       [H]|         1|     1|      1|
|       [H]|         2|     1|      1|
|    [H, F]|         3|     1|      2|
|       [F]|         4|     1|      2|
|       [F]|         5|     1|      2|
|       [H]|         6|     1|      2|
|       [W]|         7|     1|      3|
|       [W]|         8|     1|      3|
+----------+----------+------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM