[英]How to label consecutive duplicates by a column value with a unique value in PySpark?
I have data in a PySpark DataFrame that looks like this: 我在PySpark DataFrame中有如下数据:
| group | row | col |
+-------+-----+-----+
| 1 | 0 | A |
| 1 | 1 | B |
| 1 | 2 | B |
| 1 | 3 | C |
| 1 | 4 | C |
| 1 | 5 | C |
| 2 | 0 | D |
| 2 | 1 | A |
| 2 | 2 | A |
| 2 | 3 | E |
| 2 | 4 | F |
| 2 | 5 | G |
...
I would like to add an additional column that gives each "run" of consecutive identical col
values within a group
ordered by row
a unique value (could be a string, an int, doesn't really matter). 我想添加一个额外的列,该
col
按row
排序的group
每个连续“连续”相同col
值赋予一个唯一值(可以是字符串,整数,并不重要)。
A run
value choice that makes it easy to see what's happening is the concatenation of the group
, start row
, end row
, and the repeating col
value. 可以轻松查看正在发生的情况的
run
值选择是group
,start row
,end row
和重复的col
值的串联。 For the data example above, that would look like 对于上面的数据示例,看起来像
| group | row | col | run |
+-------+-----+-----+---------+
| 0 | 0 | A | 0-0-0-A |
| 0 | 1 | B | 0-1-2-B |
| 0 | 2 | B | 0-1-2-B |
| 0 | 3 | C | 0-3-5-C |
| 0 | 4 | C | 0-3-5-C |
| 0 | 5 | C | 0-3-5-C |
| 1 | 0 | D | 1-0-0-D |
| 1 | 1 | A | 1-1-2-A |
| 1 | 2 | A | 1-1-2-A |
| 1 | 3 | E | 1-3-4-E |
| 1 | 4 | E | 1-3-4-E |
| 1 | 5 | F | 1-5-5-F |
...
I've started with window functions to get a boolean demarcation of intervals: 我已经开始使用窗口函数来获取区间的布尔划分:
win = Window.partitionBy('group').orderBy('row')
df = df.withColumn('next_col', f.lead('col').over(win))
df = df.withColumn('col_same', df['col'] == df['next_col'])
But it seems like I'll have to use a call f.lag
on col_same
to get the actual intervals (perhaps into separate columns) and then call another operation to produce the run
from these additional columns. 但是好像我将不得不使用一个电话
f.lag
上col_same
以获得实际的时间间隔(或许为单独列),然后调用另一个操作生产run
这些附加列。 I feel like there is likely a simpler and more efficient approach - any suggestions would be appreciated! 我觉得可能有一种更简单,更有效的方法-任何建议将不胜感激!
You could use lag
and lead
to find the boundaries where the value of col
changes: 您可以使用
lag
lead
col
值改变的边界:
df = spark_session.createDataFrame([
Row(group=1, row=0, col='A'),
Row(group=1, row=1, col='B'),
Row(group=1, row=2, col='B'),
Row(group=1, row=3, col='C'),
Row(group=1, row=4, col='C'),
Row(group=1, row=5, col='C'),
Row(group=2, row=0, col='D'),
Row(group=2, row=1, col='A'),
Row(group=2, row=2, col='A'),
Row(group=2, row=3, col='E'),
Row(group=2, row=4, col='F'),
Row(group=2, row=5, col='G'),
])
win = Window.partitionBy('group').orderBy('row')
df2 = df.withColumn('lag', lag('col').over(win)) \
.withColumn('lead', lead('col').over(win)) \
.withColumn('start', when(col('col') != coalesce(col('lag'), lit(-1)), col('row')))\
.withColumn('end', when(col('col') != coalesce(col('lead'), lit(-1)), col('row')))\
df2.show()
Output: 输出:
+---+-----+---+----+----+-----+----+
|col|group|row| lag|lead|start| end|
+---+-----+---+----+----+-----+----+
| A| 1| 0|null| B| 0| 0|
| B| 1| 1| A| B| 1|null|
| B| 1| 2| B| C| null| 2|
| C| 1| 3| B| C| 3|null|
| C| 1| 4| C| C| null|null|
| C| 1| 5| C|null| null| 5|
| D| 2| 0|null| A| 0| 0|
| A| 2| 1| D| A| 1|null|
| A| 2| 2| A| E| null| 2|
| E| 2| 3| A| F| 3| 3|
| F| 2| 4| E| G| 4| 4|
| G| 2| 5| F|null| 5| 5|
+---+-----+---+----+----+-----+----+
To get the information into single rows as in the question, you probably need to shuffle again: 要像问题中那样将信息分成几行,您可能需要再次洗牌:
win2 = Window.partitionBy('group', 'col')
df2.select(col('group'), col('col'), col('row'),
concat_ws('-', col('group'), min('start').over(win2), max('end').over(win2), col('col')).alias('run'))\
.orderBy('group', 'row')\
.show()
Output: 输出:
+-----+---+---+-------+
|group|col|row| run|
+-----+---+---+-------+
| 1| A| 0|1-0-0-A|
| 1| B| 1|1-1-2-B|
| 1| B| 2|1-1-2-B|
| 1| C| 3|1-3-5-C|
| 1| C| 4|1-3-5-C|
| 1| C| 5|1-3-5-C|
| 2| D| 0|2-0-0-D|
| 2| A| 1|2-1-2-A|
| 2| A| 2|2-1-2-A|
| 2| E| 3|2-3-3-E|
| 2| F| 4|2-4-4-F|
| 2| G| 5|2-5-5-G|
+-----+---+---+-------+
import pyspark.sql.functions as F
from pyspark.sql import Window
df= spark.createDataFrame([[ 1 ,0 ,"A" ],[ 1 ,1 , "B" ],[1, 2 , "B" ],[1, 3 , "C" ],[1 , 4 ,"C" ],[1 ,5 ,"C" ],[2 , 0 , "D"],[2 , 1 ,"A"],[2 , 2 ,"A"],[2 ,3 ,"E" ],[2 ,4 , "F" ],[2 ,5 ,"G" ]], ["group", "row", "col"])
df1=df.groupBy("group","col").agg(F.collect_set(F.col("row").cast("string")).alias("row_arr")).select("*", F.array_min("row_arr").alias("min"),F.array_max("row_arr").alias("max"))
#if max and min are equal then create a string with 0's otherwse a strinf of the max and min elmennt
df2= df1.withColumn("arr_str", F.when(F.col("min")==F.col("max"), F.concat_ws("-", F.col("min"),F.col("min"))).otherwise(F.concat_ws("-", F.col("min").cast("string"),F.col("max").cast("string") )))
#add the group and and col to the string
df3= df2.select("group","col", F.concat_ws("-",F.col("group").cast("string"),F.concat_ws("-", "arr_str", "col")).alias("run"))
#join back to the original dataframe to get the row
df4= df.select("row", "group", "col").join(df3, ["group","col"], "inner").distinct()
df4.orderBy("group","row").show()
|group|col|row| run|
+-----+---+---+-------+
| 1| A| 0|1-0-0-A|
| 1| B| 1|1-1-2-B|
| 1| B| 2|1-1-2-B|
| 1| C| 3|1-3-5-C|
| 1| C| 4|1-3-5-C|
| 1| C| 5|1-3-5-C|
| 2| D| 0|2-0-0-D|
| 2| A| 1|2-1-2-A|
| 2| A| 2|2-1-2-A|
| 2| E| 3|2-3-3-E|
| 2| F| 4|2-4-4-F|
| 2| G| 5|2-5-5-G|
+-----+---+---+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.