简体   繁体   English

如何在PySpark中用具有唯一值的列值标记连续重复项?

[英]How to label consecutive duplicates by a column value with a unique value in PySpark?

I have data in a PySpark DataFrame that looks like this: 我在PySpark DataFrame中有如下数据:

| group | row | col |
+-------+-----+-----+
|   1   |  0  |  A  |
|   1   |  1  |  B  |
|   1   |  2  |  B  |
|   1   |  3  |  C  |
|   1   |  4  |  C  |
|   1   |  5  |  C  |
|   2   |  0  |  D  |
|   2   |  1  |  A  |
|   2   |  2  |  A  |
|   2   |  3  |  E  |
|   2   |  4  |  F  |
|   2   |  5  |  G  |
          ...

I would like to add an additional column that gives each "run" of consecutive identical col values within a group ordered by row a unique value (could be a string, an int, doesn't really matter). 我想添加一个额外的列,该colrow排序的group每个连续“连续”相同col值赋予一个唯一值(可以是字符串,整数,并不重要)。

A run value choice that makes it easy to see what's happening is the concatenation of the group , start row , end row , and the repeating col value. 可以轻松查看正在发生的情况的run值选择是group ,start row ,end row和重复的col值的串联。 For the data example above, that would look like 对于上面的数据示例,看起来像

| group | row | col |   run   |
+-------+-----+-----+---------+
|   0   |  0  |  A  | 0-0-0-A |
|   0   |  1  |  B  | 0-1-2-B |
|   0   |  2  |  B  | 0-1-2-B |
|   0   |  3  |  C  | 0-3-5-C |
|   0   |  4  |  C  | 0-3-5-C |
|   0   |  5  |  C  | 0-3-5-C |
|   1   |  0  |  D  | 1-0-0-D |
|   1   |  1  |  A  | 1-1-2-A |
|   1   |  2  |  A  | 1-1-2-A |
|   1   |  3  |  E  | 1-3-4-E |
|   1   |  4  |  E  | 1-3-4-E |
|   1   |  5  |  F  | 1-5-5-F |
          ...

I've started with window functions to get a boolean demarcation of intervals: 我已经开始使用窗口函数来获取区间的布尔划分:

win = Window.partitionBy('group').orderBy('row')
df = df.withColumn('next_col', f.lead('col').over(win))
df = df.withColumn('col_same', df['col'] == df['next_col'])

But it seems like I'll have to use a call f.lag on col_same to get the actual intervals (perhaps into separate columns) and then call another operation to produce the run from these additional columns. 但是好像我将不得不使用一个电话f.lagcol_same以获得实际的时间间隔(或许为单独列),然后调用另一个操作生产run这些附加列。 I feel like there is likely a simpler and more efficient approach - any suggestions would be appreciated! 我觉得可能有一种更简单,更有效的方法-任何建议将不胜感激!

You could use lag and lead to find the boundaries where the value of col changes: 您可以使用lag lead col值改变的边界:

df = spark_session.createDataFrame([
    Row(group=1, row=0, col='A'),
    Row(group=1, row=1, col='B'),
    Row(group=1, row=2, col='B'),
    Row(group=1, row=3, col='C'),
    Row(group=1, row=4, col='C'),
    Row(group=1, row=5, col='C'),
    Row(group=2, row=0, col='D'),
    Row(group=2, row=1, col='A'),
    Row(group=2, row=2, col='A'),
    Row(group=2, row=3, col='E'),
    Row(group=2, row=4, col='F'),
    Row(group=2, row=5, col='G'),
])

win = Window.partitionBy('group').orderBy('row')

df2 = df.withColumn('lag', lag('col').over(win)) \
    .withColumn('lead', lead('col').over(win)) \
    .withColumn('start', when(col('col') != coalesce(col('lag'), lit(-1)), col('row')))\
    .withColumn('end', when(col('col') != coalesce(col('lead'), lit(-1)), col('row')))\

df2.show()

Output: 输出:

+---+-----+---+----+----+-----+----+
|col|group|row| lag|lead|start| end|
+---+-----+---+----+----+-----+----+
|  A|    1|  0|null|   B|    0|   0|
|  B|    1|  1|   A|   B|    1|null|
|  B|    1|  2|   B|   C| null|   2|
|  C|    1|  3|   B|   C|    3|null|
|  C|    1|  4|   C|   C| null|null|
|  C|    1|  5|   C|null| null|   5|
|  D|    2|  0|null|   A|    0|   0|
|  A|    2|  1|   D|   A|    1|null|
|  A|    2|  2|   A|   E| null|   2|
|  E|    2|  3|   A|   F|    3|   3|
|  F|    2|  4|   E|   G|    4|   4|
|  G|    2|  5|   F|null|    5|   5|
+---+-----+---+----+----+-----+----+

To get the information into single rows as in the question, you probably need to shuffle again: 要像问题中那样将信息分成几行,您可能需要再次洗牌:

win2 = Window.partitionBy('group', 'col')
df2.select(col('group'), col('col'), col('row'),
           concat_ws('-', col('group'), min('start').over(win2), max('end').over(win2), col('col')).alias('run'))\
    .orderBy('group', 'row')\
    .show()

Output: 输出:

+-----+---+---+-------+
|group|col|row|    run|
+-----+---+---+-------+
|    1|  A|  0|1-0-0-A|
|    1|  B|  1|1-1-2-B|
|    1|  B|  2|1-1-2-B|
|    1|  C|  3|1-3-5-C|
|    1|  C|  4|1-3-5-C|
|    1|  C|  5|1-3-5-C|
|    2|  D|  0|2-0-0-D|
|    2|  A|  1|2-1-2-A|
|    2|  A|  2|2-1-2-A|
|    2|  E|  3|2-3-3-E|
|    2|  F|  4|2-4-4-F|
|    2|  G|  5|2-5-5-G|
+-----+---+---+-------+
import pyspark.sql.functions as F
from pyspark.sql import Window


df= spark.createDataFrame([[ 1 ,0 ,"A" ],[ 1 ,1  , "B" ],[1, 2  , "B" ],[1, 3 , "C" ],[1 , 4  ,"C" ],[1 ,5  ,"C" ],[2 , 0 , "D"],[2 , 1  ,"A"],[2 , 2  ,"A"],[2  ,3  ,"E" ],[2  ,4  , "F" ],[2  ,5  ,"G"  ]], ["group", "row", "col"])


 df1=df.groupBy("group","col").agg(F.collect_set(F.col("row").cast("string")).alias("row_arr")).select("*", F.array_min("row_arr").alias("min"),F.array_max("row_arr").alias("max"))

#if max and min are equal then create a string with 0's otherwse a strinf of the max and min elmennt

df2= df1.withColumn("arr_str", F.when(F.col("min")==F.col("max"), F.concat_ws("-", F.col("min"),F.col("min"))).otherwise(F.concat_ws("-", F.col("min").cast("string"),F.col("max").cast("string") )))

#add the group and and col to the string            

df3= df2.select("group","col", F.concat_ws("-",F.col("group").cast("string"),F.concat_ws("-", "arr_str", "col")).alias("run"))

#join back to the original dataframe to get the row

df4= df.select("row", "group", "col").join(df3, ["group","col"], "inner").distinct()

df4.orderBy("group","row").show()



|group|col|row|    run|
+-----+---+---+-------+
|    1|  A|  0|1-0-0-A|
|    1|  B|  1|1-1-2-B|
|    1|  B|  2|1-1-2-B|
|    1|  C|  3|1-3-5-C|
|    1|  C|  4|1-3-5-C|
|    1|  C|  5|1-3-5-C|
|    2|  D|  0|2-0-0-D|
|    2|  A|  1|2-1-2-A|
|    2|  A|  2|2-1-2-A|
|    2|  E|  3|2-3-3-E|
|    2|  F|  4|2-4-4-F|
|    2|  G|  5|2-5-5-G|
+-----+---+---+-------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过为行中的 label 赋予适当的值来删除列的重复项? - How to remove duplicates of column by giving appropriate value to its label in the row? PySpark - 为列 Window 提取最大值 24 小时,然后删除重复项 - PySpark - Extract Max Value for Column for 24 Hour Window, Then Drop Duplicates PySpark 删除重复项并保留列中具有最高值的行 - PySpark drop Duplicates and Keep Rows with highest value in a column 如何从 PySpark Dataframe 中删除重复项并将剩余的列值更改为 null - How to drop duplicates from PySpark Dataframe and change the remaining column value to null 使用 Pandas/Python 为列中的重复项生成唯一值 - Produce Unique value for duplicates in column using Pandas/Python PySpark - 对于每个唯一 ID 和列条件设置值为 1 - PySpark - For Each Unique ID and Column Condition Set Value of 1 Python:如何获取唯一 ID 并从第 1 列(ID)和第 3 列(描述)中删除重复项,然后在 Pandas 中获取第 2 列(值)的中值 - Python: how to get unique ID and remove duplicates from column 1 (ID), and column 3 (Description), Then get the median for column 2 (Value) in Pandas 在保持最大值的同时删除连续的重复项 - Remove consecutive duplicates while keeping the max value 将连续重复项变为一个值(Python) - Change consecutive duplicates into one value (Python) Pandas,仅删除连续重复的值 - Pandas, remove consecutive duplicates of value ONLY
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM