简体   繁体   English

PySpark-熊猫UDF的顺序计数

[英]PySpark - Sequential Counts with Pandas UDF

I have a column dataset that has increasing values any columns for the same month, and then resets for the next month. 我有一个列数据集,该数据集具有当月所有列的值递增的信息,然后在下个月重置。

+----------+------+-----------+----+-----------+------------+
|      Date|column|column_2   |co_3|column_4   |column_5    |
+----------+------+-----------+----+-----------+------------+
|2016-12-14|     0|          0|   0|         14|           0|
|2016-12-14|     0|          0|   0|         14|           0|
|2016-12-14|     0|          0|   0|         18|           0|
|2016-12-14|     0|          0|   0|         19|           0|
|2016-12-14|     0|          0|   0|         20|           0|
|2016-12-14|     0|          0|   0|         26|           0|
|2016-12-14|     0|          0|   0|         60|           0|
|2016-12-14|     0|          0|   0|         63|           0|
|2016-12-14|     0|          0|   0|         78|           0|
|2016-12-14|     0|          0|   0|         90|           0|
+----------+------+-----------+----+-----------+------------+

The problem is that their date is always the same, so I want to do some sort of counting up, and then resetting the count when we approach a different day. 问题在于它们的日期始终相同,因此我想进行某种计数,然后在接近另一天时重置计数。

I've written a Pandas UDF Function: 我已经编写了Pandas UDF函数:

@pandas_udf('int', PandasUDFType.SCALAR)
def get_counts_up(v):
  prev = None
  series = []
  count = 0
  for i in v:
    if prev != i:
      count = 0
      prev = i
    series.append(count)
    count += 1
  return pd.Series(series)

However, the output does not seem to be continuous: 但是,输出似乎不是连续的:

sdf.filter(sdf.Date == "2016-12-14").sort("Date_Count").show()

+------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+
|Date_Convert|      Date|column|column_____|col_|column_____|column______|Date_Year|Date_Month|Date_Day|Date_Epoch|count|Date_Count|
+------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         0|
|  2016-12-14|2016-12-14|     0|          0|   0|         18|           0|     2016|        12|      14|1481673600|14504|         0|
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         1|
|  2016-12-14|2016-12-14|     0|          0|   0|         18|           0|     2016|        12|      14|1481673600|14504|         1|
|  2016-12-14|2016-12-14|     0|          0|   0|         18|           0|     2016|        12|      14|1481673600|14504|         2|
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         2|
|  2016-12-14|2016-12-14|     0|          0|   0|         14|           0|     2016|        12|      14|1481673600|14504|         3|
+------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+

Which is to be expected, because I guess the dataframe is split off into different machines (a few on DataBrick's community edition), and each have their own array to maintain. 这是可以预料的,因为我想将数据帧拆分为不同的机器(DataBrick社区版中的几台机器),并且每台机器都有自己的阵列来维护。

Is there a way to perform a sequential counting up? 有没有一种方法可以执行顺序计数?

Combination of Window and row_number functions should solve it for you. Windowrow_number函数的组合应该可以为您解决。 I have used all of the columns for ordering as you've said 正如您所说,我已将所有列都用于订购

dataset that has increasing values any columns for the same month... 数据集在同一个月内的任何列的值都有增加...

but you can use only one column or many which has the increasing values. 但是您只能使用一列或多列值递增的列。

from pyspark.sql import window as w
windowSpec = w.Window.partitionBy("Date").orderBy("column", "column_2", "co_3", "column_4", "column_5")

from pyspark.sql import functions as f
df.withColumn('inc_count', f.row_number().over(windowSpec)).show(truncate=False)

which should give you 这应该给你

+----------+------+--------+----+--------+--------+---------+
|Date      |column|column_2|co_3|column_4|column_5|inc_count|
+----------+------+--------+----+--------+--------+---------+
|2016-12-14|0     |0       |0   |14      |0       |1        |
|2016-12-14|0     |0       |0   |14      |0       |2        |
|2016-12-14|0     |0       |0   |18      |0       |3        |
|2016-12-14|0     |0       |0   |19      |0       |4        |
|2016-12-14|0     |0       |0   |20      |0       |5        |
|2016-12-14|0     |0       |0   |26      |0       |6        |
|2016-12-14|0     |0       |0   |60      |0       |7        |
|2016-12-14|0     |0       |0   |63      |0       |8        |
|2016-12-14|0     |0       |0   |78      |0       |9        |
|2016-12-14|0     |0       |0   |90      |0       |10       |
+----------+------+--------+----+--------+--------+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM