如何使用 pyspark 计算连续值？

Question

I am trying to count consecutive values that appear in a column with Pyspark.我正在尝试计算出现在 Pyspark 列中的连续值。 I have the column "a" in my dataframe and expect to create the column "b".我的 dataframe 中有“a”列，并希望创建“b”列。

+---+---+
|  a|  b|
+---+---+
|  0|  1|
|  0|  2|
|  0|  3|
|  0|  4|
|  0|  5|
|  1|  1|
|  1|  2|
|  1|  3|
|  1|  4|
|  1|  5|
|  1|  6|
|  2|  1|
|  2|  2|
|  2|  3|
|  2|  4|
|  2|  5|
|  2|  6|
|  3|  1|
|  3|  2|
|  3|  3|
+---+---+

I have tried to create the column "b" with lag function over some window, but without success.我试图在一些 window 上创建滞后 function 的列“b”，但没有成功。

w = Window\
  .partitionBy(df.some_id)\
  .orderBy(df.timestamp_column)

df.withColumn(
  "b",
  f.when(df.a == f.lag(df.a).over(w),
         f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)

Answer 1

I could resolve this issue with the following code:我可以使用以下代码解决此问题：

df.withColumn("b",
  f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))

如何使用 pyspark 计算连续值？

问题描述

1 个解决方案

解决方案1
0 2020-05-13 17:48:18

如何使用 pyspark 计算连续值？

问题描述

1 个解决方案

解决方案1 0 2020-05-13 17:48:18

解决方案1
0 2020-05-13 17:48:18