[英]How do I count consecutive values with pyspark?
I am trying to count consecutive values that appear in a column with Pyspark.我正在尝试计算出现在 Pyspark 列中的连续值。 I have the column "a" in my dataframe and expect to create the column "b".我的 dataframe 中有“a”列,并希望创建“b”列。
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.我试图在一些 window 上创建滞后 function 的列“b”,但没有成功。
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:我可以使用以下代码解决此问题:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.