PySpark-熊猫UDF的顺序计数

Question

I have a column dataset that has increasing values any columns for the same month, and then resets for the next month. 我有一个列数据集，该数据集具有当月所有列的值递增的信息，然后在下个月重置。

+----------+------+-----------+----+-----------+------------+
|      Date|column|column_2   |co_3|column_4   |column_5    |
+----------+------+-----------+----+-----------+------------+
|2016-12-14|     0|          0|   0|         14|           0|
|2016-12-14|     0|          0|   0|         14|           0|
|2016-12-14|     0|          0|   0|         18|           0|
|2016-12-14|     0|          0|   0|         19|           0|
|2016-12-14|     0|          0|   0|         20|           0|
|2016-12-14|     0|          0|   0|         26|           0|
|2016-12-14|     0|          0|   0|         60|           0|
|2016-12-14|     0|          0|   0|         63|           0|
|2016-12-14|     0|          0|   0|         78|           0|
|2016-12-14|     0|          0|   0|         90|           0|
+----------+------+-----------+----+-----------+------------+

The problem is that their date is always the same, so I want to do some sort of counting up, and then resetting the count when we approach a different day. 问题在于它们的日期始终相同，因此我想进行某种计数，然后在接近另一天时重置计数。

I've written a Pandas UDF Function: 我已经编写了Pandas UDF函数：

@pandas_udf('int', PandasUDFType.SCALAR)
def get_counts_up(v):
  prev = None
  series = []
  count = 0
  for i in v:
    if prev != i:
      count = 0
      prev = i
    series.append(count)
    count += 1
  return pd.Series(series)

However, the output does not seem to be continuous: 但是，输出似乎不是连续的：

sdf.filter(sdf.Date == "2016-12-14").sort("Date_Count").show()

+------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+
|Date_Convert|      Date|column|column_____|col_|column_____|column______|Date_Year|Date_Month|Date_Day|Date_Epoch|count|Date_Count|
+------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         0|
|  2016-12-14|2016-12-14|     0|          0|   0|         18|           0|     2016|        12|      14|1481673600|14504|         0|
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         1|
|  2016-12-14|2016-12-14|     0|          0|   0|         18|           0|     2016|        12|      14|1481673600|14504|         1|
|  2016-12-14|2016-12-14|     0|          0|   0|         18|           0|     2016|        12|      14|1481673600|14504|         2|
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         2|
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         3|
+------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+

Which is to be expected, because I guess the dataframe is split off into different machines (a few on DataBrick's community edition), and each have their own array to maintain. 这是可以预料的，因为我想将数据帧拆分为不同的机器（DataBrick社区版中的几台机器），并且每台机器都有自己的阵列来维护。

Is there a way to perform a sequential counting up? 有没有一种方法可以执行顺序计数？

Answer 1

Combination of Window and row_number functions should solve it for you. Window和row_number函数的组合应该可以为您解决。 I have used all of the columns for ordering as you've said 正如您所说，我已将所有列都用于订购

dataset that has increasing values any columns for the same month... 数据集在同一个月内的任何列的值都有增加...

but you can use only one column or many which has the increasing values. 但是您只能使用一列或多列值递增的列。

from pyspark.sql import window as w
windowSpec = w.Window.partitionBy("Date").orderBy("column", "column_2", "co_3", "column_4", "column_5")

from pyspark.sql import functions as f
df.withColumn('inc_count', f.row_number().over(windowSpec)).show(truncate=False)

which should give you 这应该给你

+----------+------+--------+----+--------+--------+---------+
|Date      |column|column_2|co_3|column_4|column_5|inc_count|
+----------+------+--------+----+--------+--------+---------+
|2016-12-14|0     |0       |0   |14      |0       |1        |
|2016-12-14|0     |0       |0   |14      |0       |2        |
|2016-12-14|0     |0       |0   |18      |0       |3        |
|2016-12-14|0     |0       |0   |19      |0       |4        |
|2016-12-14|0     |0       |0   |20      |0       |5        |
|2016-12-14|0     |0       |0   |26      |0       |6        |
|2016-12-14|0     |0       |0   |60      |0       |7        |
|2016-12-14|0     |0       |0   |63      |0       |8        |
|2016-12-14|0     |0       |0   |78      |0       |9        |
|2016-12-14|0     |0       |0   |90      |0       |10       |
+----------+------+--------+----+--------+--------+---------+

PySpark-熊猫UDF的顺序计数

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-04-18 02:10:07

PySpark-熊猫UDF的顺序计数

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-04-18 02:10:07

解决方案1
1 已采纳 2018-04-18 02:10:07