简体   繁体   English

Pyspark:如何从 Dataframe 中的特定列值引导

[英]Pyspark : How to lead from specific column value in Dataframe

The dataframe is already sorted out by date, dataframe已经按日期整理,

col1 ==1 value is unique, col1 ==1 值是唯一的,

and only the 0 have duplicates.并且只有 0 有重复项。

I have a dataframe looks like this call it df我有一个 dataframe 看起来像这样叫它df

+--------+----+----+
    date |col1|col2|
+--------+----+----+
2020-08-01| 5|  -1|
2020-08-02| 4|  -1|
2020-08-03| 3|   3|
2020-08-04| 2|   2|
2020-08-05| 1|   4|
2020-08-06| 0|   1|
2020-08-07| 0|   2|
2020-08-08| 0|   3|
2020-08-09| 0|  -1|
+--------+----+----+

The condition is when col1 == 1, then we start adding backwards from col2 ==4, (eg. 4,5,6,7,8,...) and the after col2 == 4 return 0 all the way (eg. 4,0,0,0,0...)条件是当 col1 == 1 时,我们从 col2 ==4 开始向后添加,(例如 4,5,6,7,8,...),然后 col2 == 4 一直返回 0(例如 4,0,0,0,0...)

So, my resulted df will look something like this.所以,我得到的df看起来像这样。

    +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| 5|  -1|  8 |
    2020-08-02| 4|  -1|  7 |
    2020-08-03| 3|   3|  6 |
    2020-08-04| 2|   2|  5 |
    2020-08-05| 1|   4|  4 |
    2020-08-06| 0|   1|  0 |
    2020-08-07| 0|   2|  0 |
    2020-08-08| 0|   3|  0 |
    2020-08-09| 0|  -1|  0 |
   +---------+----+----+----+  

Enhancement: I want to add additional condition where col2 == -1 when col1 == 1 row, and -1 goes consecutive, then I want to count consecutive -1, and then add with next col2 ==?增强:我想在 col1 == 1 行时添加额外的条件 col2 == -1,并且 -1 连续,然后我想计算连续的 -1,然后添加下一个 col2 ==? value.价值。 so here's an example to clear.所以这里有一个例子来清除。

    +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| 5|  -1|  11|
    2020-08-02| 4|  -1|  10|
    2020-08-03| 3|   3|  9 |
    2020-08-04| 2|   2|  8 |
    2020-08-05| 1|  -1|  7 |
    2020-08-06| 0|  -1|  0 |
    2020-08-07| 0|  -1|  0 |
    2020-08-08| 0|   4|  0 |
    2020-08-09| 0|  -1|  0 |
   +---------+----+----+----+   

so, we see 3 consecutive -1s, (we only care about first consecutive -1s) and after the consecutive we have 4, then we would have 4+ 3 =7 at the col1 ==1 row.因此,我们看到3个连续的 -1(我们只关心第一个连续的 -1),在连续之后我们有 4 个,那么我们将在 col1 ==1 行有 4+ 3 =7。 is it possible?可能吗?

Here is my try:这是我的尝试:

from pyspark.sql.functions import sum, when, col, rank, desc
from pyspark.sql import Window

w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('case').orderBy(desc('date'))

df.withColumn('case', sum(when(col('col1') == 1, col('col2')).otherwise(0)).over(w1)) \
  .withColumn('rank', when(col('case') != 0, rank().over(w2) - 1).otherwise(0)) \
  .withColumn('want', col('case') + col('rank')) \
  .orderBy('date') \
  .show(10, False)

+----------+----+----+----+----+----+
|date      |col1|col2|case|rank|want|
+----------+----+----+----+----+----+
|2020-08-01|5   |-1  |4   |4   |8   |
|2020-08-02|4   |-1  |4   |3   |7   |
|2020-08-03|3   |3   |4   |2   |6   |
|2020-08-04|2   |2   |4   |1   |5   |
|2020-08-05|1   |4   |4   |0   |4   |
|2020-08-06|0   |1   |0   |0   |0   |
|2020-08-07|0   |2   |0   |0   |0   |
|2020-08-08|0   |3   |0   |0   |0   |
|2020-08-09|0   |-1  |0   |0   |0   |
+----------+----+----+----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 pyspark dataframe 列中的列表中删除特定字符串 - How to remove specific strings from a list in pyspark dataframe column 如何定位 Pyspark Dataframe 中特定行中的特定列? - How to target a specific column in a specific row in Pyspark Dataframe? Pyspark - 如何根据数据帧 2 中的列值在数据帧 1 中插入记录 - Pyspark - How to insert records in dataframe 1, based on a column value in dataframe2 从DataFrame获取基于其他列值的值(PySpark) - Getting a value from DataFrame based on other column value (PySpark) 修改 Pyspark 中 dataframe 的列值 - Modifying column value of a dataframe in Pyspark 如何按特定内部元素对 PySpark 中的 dataframe 嵌套数组列进行排序 - How to sort dataframe nested array column in PySpark by specific inner element 如何用另一个值替换 Pyspark Dataframe 列中的特定值? - How to replace a particular value in a Pyspark Dataframe column with another value? Pyspark:如何从另一个 dataframe 向 dataframe 添加列? - Pyspark: how to add a column to a dataframe from another dataframe? 如何从 pandas dataframe 中查找特定值的列名 - how to find column name of specific value from pandas dataframe 如何在 pandas dataframe 的特定行和列中插入输入值 - How to insert a value from an input in a specific row and column in a pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM