如何在PySpark中用具有唯一值的列值标记连续重复项？

Question

I have data in a PySpark DataFrame that looks like this: 我在PySpark DataFrame中有如下数据：

| group | row | col |
+-------+-----+-----+
|   1   |  0  |  A  |
|   1   |  1  |  B  |
|   1   |  2  |  B  |
|   1   |  3  |  C  |
|   1   |  4  |  C  |
|   1   |  5  |  C  |
|   2   |  0  |  D  |
|   2   |  1  |  A  |
|   2   |  2  |  A  |
|   2   |  3  |  E  |
|   2   |  4  |  F  |
|   2   |  5  |  G  |
          ...

I would like to add an additional column that gives each "run" of consecutive identical col values within a group ordered by row a unique value (could be a string, an int, doesn't really matter). 我想添加一个额外的列，该col按row排序的group每个连续“连续”相同col值赋予一个唯一值（可以是字符串，整数，并不重要）。

A run value choice that makes it easy to see what's happening is the concatenation of the group , start row , end row , and the repeating col value. 可以轻松查看正在发生的情况的run值选择是group ，start row ，end row和重复的col值的串联。 For the data example above, that would look like 对于上面的数据示例，看起来像

| group | row | col |   run   |
+-------+-----+-----+---------+
|   0   |  0  |  A  | 0-0-0-A |
|   0   |  1  |  B  | 0-1-2-B |
|   0   |  2  |  B  | 0-1-2-B |
|   0   |  3  |  C  | 0-3-5-C |
|   0   |  4  |  C  | 0-3-5-C |
|   0   |  5  |  C  | 0-3-5-C |
|   1   |  0  |  D  | 1-0-0-D |
|   1   |  1  |  A  | 1-1-2-A |
|   1   |  2  |  A  | 1-1-2-A |
|   1   |  3  |  E  | 1-3-4-E |
|   1   |  4  |  E  | 1-3-4-E |
|   1   |  5  |  F  | 1-5-5-F |
          ...

I've started with window functions to get a boolean demarcation of intervals: 我已经开始使用窗口函数来获取区间的布尔划分：

win = Window.partitionBy('group').orderBy('row')
df = df.withColumn('next_col', f.lead('col').over(win))
df = df.withColumn('col_same', df['col'] == df['next_col'])

But it seems like I'll have to use a call f.lag on col_same to get the actual intervals (perhaps into separate columns) and then call another operation to produce the run from these additional columns. 但是好像我将不得不使用一个电话f.lag上col_same以获得实际的时间间隔（或许为单独列），然后调用另一个操作生产run这些附加列。 I feel like there is likely a simpler and more efficient approach - any suggestions would be appreciated! 我觉得可能有一种更简单，更有效的方法-任何建议将不胜感激！

Answer 1

You could use lag and lead to find the boundaries where the value of col changes: 您可以使用lag lead col值改变的边界：

df = spark_session.createDataFrame([
    Row(group=1, row=0, col='A'),
    Row(group=1, row=1, col='B'),
    Row(group=1, row=2, col='B'),
    Row(group=1, row=3, col='C'),
    Row(group=1, row=4, col='C'),
    Row(group=1, row=5, col='C'),
    Row(group=2, row=0, col='D'),
    Row(group=2, row=1, col='A'),
    Row(group=2, row=2, col='A'),
    Row(group=2, row=3, col='E'),
    Row(group=2, row=4, col='F'),
    Row(group=2, row=5, col='G'),
])

win = Window.partitionBy('group').orderBy('row')

df2 = df.withColumn('lag', lag('col').over(win)) \
    .withColumn('lead', lead('col').over(win)) \
    .withColumn('start', when(col('col') != coalesce(col('lag'), lit(-1)), col('row')))\
    .withColumn('end', when(col('col') != coalesce(col('lead'), lit(-1)), col('row')))\

df2.show()

Output: 输出：

+---+-----+---+----+----+-----+----+
|col|group|row| lag|lead|start| end|
+---+-----+---+----+----+-----+----+
|  A|    1|  0|null|   B|    0|   0|
|  B|    1|  1|   A|   B|    1|null|
|  B|    1|  2|   B|   C| null|   2|
|  C|    1|  3|   B|   C|    3|null|
|  C|    1|  4|   C|   C| null|null|
|  C|    1|  5|   C|null| null|   5|
|  D|    2|  0|null|   A|    0|   0|
|  A|    2|  1|   D|   A|    1|null|
|  A|    2|  2|   A|   E| null|   2|
|  E|    2|  3|   A|   F|    3|   3|
|  F|    2|  4|   E|   G|    4|   4|
|  G|    2|  5|   F|null|    5|   5|
+---+-----+---+----+----+-----+----+

To get the information into single rows as in the question, you probably need to shuffle again: 要像问题中那样将信息分成几行，您可能需要再次洗牌：

win2 = Window.partitionBy('group', 'col')
df2.select(col('group'), col('col'), col('row'),
           concat_ws('-', col('group'), min('start').over(win2), max('end').over(win2), col('col')).alias('run'))\
    .orderBy('group', 'row')\
    .show()

Output: 输出：

+-----+---+---+-------+
|group|col|row|    run|
+-----+---+---+-------+
|    1|  A|  0|1-0-0-A|
|    1|  B|  1|1-1-2-B|
|    1|  B|  2|1-1-2-B|
|    1|  C|  3|1-3-5-C|
|    1|  C|  4|1-3-5-C|
|    1|  C|  5|1-3-5-C|
|    2|  D|  0|2-0-0-D|
|    2|  A|  1|2-1-2-A|
|    2|  A|  2|2-1-2-A|
|    2|  E|  3|2-3-3-E|
|    2|  F|  4|2-4-4-F|
|    2|  G|  5|2-5-5-G|
+-----+---+---+-------+

Answer 2

import pyspark.sql.functions as F
from pyspark.sql import Window


df= spark.createDataFrame([[ 1 ,0 ,"A" ],[ 1 ,1  , "B" ],[1, 2  , "B" ],[1, 3 , "C" ],[1 , 4  ,"C" ],[1 ,5  ,"C" ],[2 , 0 , "D"],[2 , 1  ,"A"],[2 , 2  ,"A"],[2  ,3  ,"E" ],[2  ,4  , "F" ],[2  ,5  ,"G"  ]], ["group", "row", "col"])


 df1=df.groupBy("group","col").agg(F.collect_set(F.col("row").cast("string")).alias("row_arr")).select("*", F.array_min("row_arr").alias("min"),F.array_max("row_arr").alias("max"))

#if max and min are equal then create a string with 0's otherwse a strinf of the max and min elmennt

df2= df1.withColumn("arr_str", F.when(F.col("min")==F.col("max"), F.concat_ws("-", F.col("min"),F.col("min"))).otherwise(F.concat_ws("-", F.col("min").cast("string"),F.col("max").cast("string") )))

#add the group and and col to the string            

df3= df2.select("group","col", F.concat_ws("-",F.col("group").cast("string"),F.concat_ws("-", "arr_str", "col")).alias("run"))

#join back to the original dataframe to get the row

df4= df.select("row", "group", "col").join(df3, ["group","col"], "inner").distinct()

df4.orderBy("group","row").show()



|group|col|row|    run|
+-----+---+---+-------+
|    1|  A|  0|1-0-0-A|
|    1|  B|  1|1-1-2-B|
|    1|  B|  2|1-1-2-B|
|    1|  C|  3|1-3-5-C|
|    1|  C|  4|1-3-5-C|
|    1|  C|  5|1-3-5-C|
|    2|  D|  0|2-0-0-D|
|    2|  A|  1|2-1-2-A|
|    2|  A|  2|2-1-2-A|
|    2|  E|  3|2-3-3-E|
|    2|  F|  4|2-4-4-F|
|    2|  G|  5|2-5-5-G|
+-----+---+---+-------+

如何在PySpark中用具有唯一值的列值标记连续重复项？

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-08-11 18:50:42

解决方案2
1 2019-08-11 22:02:04

如何在PySpark中用具有唯一值的列值标记连续重复项？

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-08-11 18:50:42

解决方案2 1 2019-08-11 22:02:04

解决方案1
1 已采纳 2019-08-11 18:50:42

解决方案2
1 2019-08-11 22:02:04