Pyspark：如何从 Dataframe 中的特定列值引导

Question

The dataframe is already sorted out by date, dataframe已经按日期整理，

col1 ==1 value is unique, col1 ==1 值是唯一的，

and only the 0 have duplicates.并且只有 0 有重复项。

I have a dataframe looks like this call it df我有一个 dataframe 看起来像这样叫它df

+--------+----+----+
    date |col1|col2|
+--------+----+----+
2020-08-01| 5|  -1|
2020-08-02| 4|  -1|
2020-08-03| 3|   3|
2020-08-04| 2|   2|
2020-08-05| 1|   4|
2020-08-06| 0|   1|
2020-08-07| 0|   2|
2020-08-08| 0|   3|
2020-08-09| 0|  -1|
+--------+----+----+

The condition is when col1 == 1, then we start adding backwards from col2 ==4, (eg. 4,5,6,7,8,...) and the after col2 == 4 return 0 all the way (eg. 4,0,0,0,0...)条件是当 col1 == 1 时，我们从 col2 ==4 开始向后添加，（例如 4,5,6,7,8,...），然后 col2 == 4 一直返回 0（例如 4,0,0,0,0...)

So, my resulted df will look something like this.所以，我得到的df看起来像这样。

    +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| 5|  -1|  8 |
    2020-08-02| 4|  -1|  7 |
    2020-08-03| 3|   3|  6 |
    2020-08-04| 2|   2|  5 |
    2020-08-05| 1|   4|  4 |
    2020-08-06| 0|   1|  0 |
    2020-08-07| 0|   2|  0 |
    2020-08-08| 0|   3|  0 |
    2020-08-09| 0|  -1|  0 |
   +---------+----+----+----+

Enhancement: I want to add additional condition where col2 == -1 when col1 == 1 row, and -1 goes consecutive, then I want to count consecutive -1, and then add with next col2 ==?增强：我想在 col1 == 1 行时添加额外的条件 col2 == -1，并且 -1 连续，然后我想计算连续的 -1，然后添加下一个 col2 ==？ value.价值。 so here's an example to clear.所以这里有一个例子来清除。

    +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| 5|  -1|  11|
    2020-08-02| 4|  -1|  10|
    2020-08-03| 3|   3|  9 |
    2020-08-04| 2|   2|  8 |
    2020-08-05| 1|  -1|  7 |
    2020-08-06| 0|  -1|  0 |
    2020-08-07| 0|  -1|  0 |
    2020-08-08| 0|   4|  0 |
    2020-08-09| 0|  -1|  0 |
   +---------+----+----+----+

so, we see 3 consecutive -1s, (we only care about first consecutive -1s) and after the consecutive we have 4, then we would have 4+ 3 =7 at the col1 ==1 row.因此，我们看到3个连续的 -1（我们只关心第一个连续的 -1），在连续之后我们有 4 个，那么我们将在 col1 ==1 行有 4+ 3 =7。 is it possible?可能吗？

Answer 1

Here is my try:这是我的尝试：

from pyspark.sql.functions import sum, when, col, rank, desc
from pyspark.sql import Window

w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('case').orderBy(desc('date'))

df.withColumn('case', sum(when(col('col1') == 1, col('col2')).otherwise(0)).over(w1)) \
  .withColumn('rank', when(col('case') != 0, rank().over(w2) - 1).otherwise(0)) \
  .withColumn('want', col('case') + col('rank')) \
  .orderBy('date') \
  .show(10, False)

+----------+----+----+----+----+----+
|date      |col1|col2|case|rank|want|
+----------+----+----+----+----+----+
|2020-08-01|5   |-1  |4   |4   |8   |
|2020-08-02|4   |-1  |4   |3   |7   |
|2020-08-03|3   |3   |4   |2   |6   |
|2020-08-04|2   |2   |4   |1   |5   |
|2020-08-05|1   |4   |4   |0   |4   |
|2020-08-06|0   |1   |0   |0   |0   |
|2020-08-07|0   |2   |0   |0   |0   |
|2020-08-08|0   |3   |0   |0   |0   |
|2020-08-09|0   |-1  |0   |0   |0   |
+----------+----+----+----+----+----+

Pyspark：如何从 Dataframe 中的特定列值引导

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-06 13:46:44

Pyspark：如何从 Dataframe 中的特定列值引导

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-06 13:46:44

解决方案1
1 已采纳 2020-08-06 13:46:44