[英]Pyspark : How to lead from specific column value in Dataframe
The dataframe is already sorted out by date, dataframe已经按日期整理,
col1 ==1 value is unique, col1 ==1 值是唯一的,
and only the 0 have duplicates.并且只有 0 有重复项。
I have a dataframe looks like this call it df我有一个 dataframe 看起来像这样叫它df
+--------+----+----+
date |col1|col2|
+--------+----+----+
2020-08-01| 5| -1|
2020-08-02| 4| -1|
2020-08-03| 3| 3|
2020-08-04| 2| 2|
2020-08-05| 1| 4|
2020-08-06| 0| 1|
2020-08-07| 0| 2|
2020-08-08| 0| 3|
2020-08-09| 0| -1|
+--------+----+----+
The condition is when col1 == 1, then we start adding backwards from col2 ==4, (eg. 4,5,6,7,8,...) and the after col2 == 4 return 0 all the way (eg. 4,0,0,0,0...)条件是当 col1 == 1 时,我们从 col2 ==4 开始向后添加,(例如 4,5,6,7,8,...),然后 col2 == 4 一直返回 0(例如 4,0,0,0,0...)
So, my resulted df will look something like this.所以,我得到的df看起来像这样。
+--------+----+----+----+
date |col1|col2|want
+--------+----+----+----+
2020-08-01| 5| -1| 8 |
2020-08-02| 4| -1| 7 |
2020-08-03| 3| 3| 6 |
2020-08-04| 2| 2| 5 |
2020-08-05| 1| 4| 4 |
2020-08-06| 0| 1| 0 |
2020-08-07| 0| 2| 0 |
2020-08-08| 0| 3| 0 |
2020-08-09| 0| -1| 0 |
+---------+----+----+----+
Enhancement: I want to add additional condition where col2 == -1 when col1 == 1 row, and -1 goes consecutive, then I want to count consecutive -1, and then add with next col2 ==?增强:我想在 col1 == 1 行时添加额外的条件 col2 == -1,并且 -1 连续,然后我想计算连续的 -1,然后添加下一个 col2 ==? value.
价值。 so here's an example to clear.
所以这里有一个例子来清除。
+--------+----+----+----+
date |col1|col2|want
+--------+----+----+----+
2020-08-01| 5| -1| 11|
2020-08-02| 4| -1| 10|
2020-08-03| 3| 3| 9 |
2020-08-04| 2| 2| 8 |
2020-08-05| 1| -1| 7 |
2020-08-06| 0| -1| 0 |
2020-08-07| 0| -1| 0 |
2020-08-08| 0| 4| 0 |
2020-08-09| 0| -1| 0 |
+---------+----+----+----+
so, we see 3 consecutive -1s, (we only care about first consecutive -1s) and after the consecutive we have 4, then we would have 4+ 3 =7 at the col1 ==1 row.因此,我们看到3个连续的 -1(我们只关心第一个连续的 -1),在连续之后我们有 4 个,那么我们将在 col1 ==1 行有 4+ 3 =7。 is it possible?
可能吗?
Here is my try:这是我的尝试:
from pyspark.sql.functions import sum, when, col, rank, desc
from pyspark.sql import Window
w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('case').orderBy(desc('date'))
df.withColumn('case', sum(when(col('col1') == 1, col('col2')).otherwise(0)).over(w1)) \
.withColumn('rank', when(col('case') != 0, rank().over(w2) - 1).otherwise(0)) \
.withColumn('want', col('case') + col('rank')) \
.orderBy('date') \
.show(10, False)
+----------+----+----+----+----+----+
|date |col1|col2|case|rank|want|
+----------+----+----+----+----+----+
|2020-08-01|5 |-1 |4 |4 |8 |
|2020-08-02|4 |-1 |4 |3 |7 |
|2020-08-03|3 |3 |4 |2 |6 |
|2020-08-04|2 |2 |4 |1 |5 |
|2020-08-05|1 |4 |4 |0 |4 |
|2020-08-06|0 |1 |0 |0 |0 |
|2020-08-07|0 |2 |0 |0 |0 |
|2020-08-08|0 |3 |0 |0 |0 |
|2020-08-09|0 |-1 |0 |0 |0 |
+----------+----+----+----+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.