简体   繁体   English

PySpark select 行 其中列等于当前行中的参数值

[英]PySpark select Row Where column equals parameter value in current row

I have a data frame that has a current date value I create a new column in the data frame that calculates one month ago like so,我有一个具有当前日期值的数据框我在一个月前计算的数据框中创建了一个新列,如下所示,

 spark_df = spark_df.withColumn("oneMonthAgo", expr("add_months(calendarday, -1)"))

I would like to go back and find the row that has the AsofDate that matches that oneMonthAgo column and include its value as a new column called 1MonthAgoValue.我想返回 go 并找到具有与 oneMonthAgo 列匹配的 AsofDate 的行,并将其值包含为名为 1MonthAgoValue 的新列。 Origninaly I was trying to use window functions to go back but because no oneMonthAgo column matches dates it would return its own value in the 1MonthAgoValue column. Origninaly 我试图使用 window 函数返回 go 但因为没有 oneMonthAgo 列与日期匹配,它会在 1MonthAgoValue 列中返回自己的值。

 Beginning Data Frame
 +----------+-----------+-----------+

 |  AsofDate|oneMonthAgo|      value|

 +----------+-----------+-----------+

 |2019-02-23| 2019-02-20|          2|

 |2019-03-20| 2019-02-20|          7|

 |2019-03-21| 2019-02-21|         12|

 |2019-03-22| 2019-02-22|         27|

 |2019-03-23| 2019-02-23|         91|

 +----------+-----------+-----------+


 Data Frame to end up with
 +----------+-----------+-----------+-----------------------+

 |  AsofDate|oneMonthAgo|      value|         1MonthAgoValue|

 +----------+-----------+-----------+-----------------------+

 |2019-02-23| 2019-02-20|          2|                  null |

 |2019-03-20| 2019-02-20|          7|                  null |

 |2019-03-21| 2019-02-21|         12|                  null |

 |2019-03-22| 2019-02-22|         27|                  null |

 |2019-03-23| 2019-02-23|         91|                     2 |  (oneMonthAgo matches first rows AsofDate include first rows value column in current row as 1MonthAgoValue )

 +----------+-----------+-----------+-----------------------+



 +---------------+-----------------------+-----------+--------------+-----+-------+----+------------+--------------+------------------+---------------+---------------+-----------------+---------------+-------------------+-------------------------+---------------+
  |accountname|             calendarday|     market|returnposition| year| month| day|yearUnique|last_monday|firstDayOfMonth|oneMonthAgo|twoMonthAgo| threeMonthAgo| sixMonthAgo|twelveMonthAgo|                       indexCP|inceptionDate| 
 +--------------+-----------------------+-----------+--------------+-----+-------+----+------------+--------------+------------------+---------------+----------------+-----------------+---------------+-------------------+-------------------------+---------------+
  |          Giants|2015-01-02 00:00:00|           null|               null| 2015|       01|   02|2015-01-02| 2014-12-29|        2015-01-01|   2014-12-02|      2014-11-02|      2014-10-02|    2014-07-02|         2014-01-02|                           100.0|    2015-01-02|
  |          Giants|2015-01-05 00:00:00|110086.25|         0.0105| 2015|       01|   05|2015-01-05| 2015-01-05|        2015-01-01|   2014-12-05|      2014-11-05|      2014-10-05|    2014-07-05|         2014-01-05|                         101.05|    2015-01-02|
  |          Giants|2015-01-06 00:00:00|  201251.5|         2.0E-4| 2015|       01|   06|2015-01-06| 2015-01-05|        2015-01-01|   2014-12-06|      2014-11-06|      2014-10-06|    2014-07-06|         2014-01-06|  101.07020999999999|   2015-01-02|
  |          Giants|2015-01-07 00:00:00|  216786.5|         -0.006| 2015|       01|   07|2015-01-07| 2015-01-05|        2015-01-01|   2014-12-07|      2014-11-07|      2014-10-07|    2014-07-07|         2014-01-07|              100.46378874|    2015-01-02|
  |          Giants|2015-01-08 00:00:00|  215464.5|       -0.0063| 2015|       01|   08|2015-01-08| 2015-01-05|        2015-01-01|   2014-12-08|      2014-11-08|      2014-10-08|    2014-07-08|         2014-01-08|        99.830866870938|    2015-01-02|
  |          Giants|2015-01-09 00:00:00|214103.25|        0.0052| 2015|       01|   09|2015-01-09| 2015-01-05|        2015-01-01|   2014-12-09|      2014-11-09|      2014-10-09|    2014-07-09|         2014-01-09|  100.34998737866687|    2015-01-02|
  |          Giants|2015-01-12 00:00:00|  215218.5|       -4.0E-4| 2015|       01|   12|2015-01-12| 2015-01-12|        2015-01-01|   2014-12-12|      2014-11-12|      2014-10-12|    2014-07-12|         2014-01-12|  100.30984738371541|    2015-01-02|
  |          Giants|2015-01-13 00:00:00|215125.25|        0.0036| 2015|       01|   13|2015-01-13| 2015-01-12|        2015-01-01|   2014-12-13|      2014-11-13|      2014-10-13|    2014-07-13|         2014-01-13|  100.67096283429677|    2015-01-02|
  |          Giants|2015-01-14 00:00:00|  215919.5|        8.0E-4| 2015|       01|   14|2015-01-14| 2015-01-12|        2015-01-01|   2014-12-14|      2014-11-14|      2014-10-14|    2014-07-14|         2014-01-14|    100.7514996045642|    2015-01-02|
  |          Giants|2015-01-15 00:00:00|216103.75|        4.0E-4| 2015|       01|   15|2015-01-15| 2015-01-12|        2015-01-01|   2014-12-15|      2014-11-15|      2014-10-15|    2014-07-15|         2014-01-15|  100.79180020440602|    2015-01-02|
  |          Giants|2015-01-16 00:00:00|  216205.5|       0.0052| 2015|       01|   16|2015-01-16| 2015-01-12|        2015-01-01|   2014-12-16|      2014-11-16|      2014-10-16|    2014-07-16|         2014-01-16|  101.31591756546894|    2015-01-02|
  |          Giants|2015-01-19 00:00:00|  347334.0|      -0.0045| 2015|       01|   19|2015-01-19| 2015-01-19|        2015-01-01|   2014-12-19|      2014-11-19|      2014-10-19|    2014-07-19|         2014-01-19|  100.85999593642434|    2015-01-02|
  |          Giants|2015-01-20 00:00:00|  345767.0|       0.0015| 2015|       01|   20|2015-01-20| 2015-01-19|        2015-01-01|   2014-12-20|      2014-11-20|      2014-10-20|    2014-07-20|         2014-01-20|  101.01128593032898|    2015-01-02|
  |          Giants|2015-01-21 00:00:00|  346314.5|       2.0E-4| 2015|       01|   21|2015-01-21| 2015-01-19|        2015-01-01|   2014-12-21|      2014-11-21|      2014-10-21|    2014-07-21|         2014-01-21|  101.03148818751504|    2015-01-02|
  |          Giants|2015-01-22 00:00:00|346399.75|       0.0029| 2015|       01|   22|2015-01-22| 2015-01-19|        2015-01-01|   2014-12-22|      2014-11-22|      2014-10-22|    2014-07-22|         2014-01-22| 101.32447950325883|    2015-01-02|
  |          Giants|2015-01-23 00:00:00|347412.75|      -6.0E-4| 2015|       01|   23|2015-01-23| 2015-01-19|        2015-01-01|   2014-12-23|      2014-11-23|      2014-10-23|    2014-07-23|         2014-01-23| 101.26368481555686|    2015-01-02|
  |          Giants|2015-01-26 00:00:00|348303.75|      -6.0E-4| 2015|       01|   26|2015-01-26| 2015-01-26|        2015-01-01|   2014-12-26|      2014-11-26|      2014-10-26|    2014-07-26|         2014-01-26| 101.20292660466752|    2015-01-02|
  |          Giants|2015-01-27 00:00:00|  348541.0|      -0.0044| 2015|       01|   27|2015-01-27| 2015-01-26|        2015-01-01|   2014-12-27|      2014-11-27|      2014-10-27|    2014-07-27|         2014-01-27| 100.75763372760697|    2015-01-02|
  |          Giants|2015-01-28 00:00:00|347579.25|       0.0015| 2015|       01|   28|2015-01-28| 2015-01-26|        2015-01-01|   2014-12-28|      2014-11-28|      2014-10-28|    2014-07-28|         2014-01-28|   100.9087701781984|    2015-01-02|
  |          Giants|2015-01-29 00:00:00|348431.75|       2.0E-4| 2015|       01|   29|2015-01-29| 2015-01-26|        2015-01-01|   2014-12-29|      2014-11-29|      2014-10-29|    2014-07-29|         2014-01-29|  100.92895193223403|    2015-01-02|
  |        Yankees|2015-01-02 00:00:00|           null|               null| 2015|       01|   02|2015-01-02| 2014-12-29|        2015-01-01|   2014-12-02|      2014-11-02|      2014-10-02|    2014-07-02|         2014-01-02|                           100.0|    2015-01-02|
  |        Yankees|2015-01-05 00:00:00|110086.25|         0.0105| 2015|       01|   05|2015-01-05| 2015-01-05|        2015-01-01|   2014-12-05|      2014-11-05|      2014-10-05|    2014-07-05|         2014-01-05|                         101.05|    2015-01-02|
  |        Yankees|2015-01-06 00:00:00|  201251.5|         2.0E-4| 2015|       01|   06|2015-01-06| 2015-01-05|        2015-01-01|   2014-12-06|      2014-11-06|      2014-10-06|    2014-07-06|         2014-01-06|  101.07020999999999|   2015-01-02|
  |        Yankees|2015-01-07 00:00:00|  216786.5|         -0.006| 2015|       01|   07|2015-01-07| 2015-01-05|        2015-01-01|   2014-12-07|      2014-11-07|      2014-10-07|    2014-07-07|         2014-01-07|              100.46378874|    2015-01-02|
  |        Yankees|2015-01-08 00:00:00|  215464.5|       -0.0063| 2015|       01|   08|2015-01-08| 2015-01-05|        2015-01-01|   2014-12-08|      2014-11-08|      2014-10-08|    2014-07-08|         2014-01-08|        99.830866870938|    2015-01-02|
  |        Yankees|2015-01-09 00:00:00|214103.25|        0.0052| 2015|       01|   09|2015-01-09| 2015-01-05|        2015-01-01|   2014-12-09|      2014-11-09|      2014-10-09|    2014-07-09|         2014-01-09|  100.34998737866687|    2015-01-02|
  |        Yankees|2015-01-12 00:00:00|  215218.5|       -4.0E-4| 2015|       01|   12|2015-01-12| 2015-01-12|        2015-01-01|   2014-12-12|      2014-11-12|      2014-10-12|    2014-07-12|         2014-01-12|  100.30984738371541|    2015-01-02|
  |        Yankees|2015-01-13 00:00:00|215125.25|        0.0036| 2015|       01|   13|2015-01-13| 2015-01-12|        2015-01-01|   2014-12-13|      2014-11-13|      2014-10-13|    2014-07-13|         2014-01-13|  100.67096283429677|    2015-01-02|
  |        Yankees|2015-01-14 00:00:00|  215919.5|        8.0E-4| 2015|       01|   14|2015-01-14| 2015-01-12|        2015-01-01|   2014-12-14|      2014-11-14|      2014-10-14|    2014-07-14|         2014-01-14|    100.7514996045642|    2015-01-02|
  |        Yankees|2015-01-15 00:00:00|216103.75|        4.0E-4| 2015|       01|   15|2015-01-15| 2015-01-12|        2015-01-01|   2014-12-15|      2014-11-15|      2014-10-15|    2014-07-15|         2014-01-15|  100.79180020440602|    2015-01-02|
  |        Yankees|2015-01-16 00:00:00|  216205.5|       0.0052| 2015|       01|   16|2015-01-16| 2015-01-12|        2015-01-01|   2014-12-16|      2014-11-16|      2014-10-16|    2014-07-16|         2014-01-16|  101.31591756546894|    2015-01-02|
  |        Yankees|2015-01-19 00:00:00|  347334.0|      -0.0045| 2015|       01|   19|2015-01-19| 2015-01-19|        2015-01-01|   2014-12-19|      2014-11-19|      2014-10-19|    2014-07-19|         2014-01-19|  100.85999593642434|    2015-01-02|
  |        Yankees|2015-01-20 00:00:00|  345767.0|       0.0015| 2015|       01|   20|2015-01-20| 2015-01-19|        2015-01-01|   2014-12-20|      2014-11-20|      2014-10-20|    2014-07-20|         2014-01-20|  101.01128593032898|    2015-01-02|
  |        Yankees|2015-01-21 00:00:00|  346314.5|       2.0E-4| 2015|       01|   21|2015-01-21| 2015-01-19|        2015-01-01|   2014-12-21|      2014-11-21|      2014-10-21|    2014-07-21|         2014-01-21|  101.03148818751504|    2015-01-02|
  |        Yankees|2015-01-22 00:00:00|346399.75|       0.0029| 2015|       01|   22|2015-01-22| 2015-01-19|        2015-01-01|   2014-12-22|      2014-11-22|      2014-10-22|    2014-07-22|         2014-01-22| 101.32447950325883|    2015-01-02|
  |        Yankees|2015-01-23 00:00:00|347412.75|      -6.0E-4| 2015|       01|   23|2015-01-23| 2015-01-19|        2015-01-01|   2014-12-23|      2014-11-23|      2014-10-23|    2014-07-23|         2014-01-23| 101.26368481555686|    2015-01-02|
  |        Yankees|2015-01-26 00:00:00|348303.75|      -6.0E-4| 2015|       01|   26|2015-01-26| 2015-01-26|        2015-01-01|   2014-12-26|      2014-11-26|      2014-10-26|    2014-07-26|         2014-01-26| 101.20292660466752|    2015-01-02|
  |        Yankees|2015-01-27 00:00:00|  348541.0|      -0.0044| 2015|       01|   27|2015-01-27| 2015-01-26|        2015-01-01|   2014-12-27|      2014-11-27|      2014-10-27|    2014-07-27|         2014-01-27| 100.75763372760697|    2015-01-02|
  |        Yankees|2015-01-28 00:00:00|347579.25|       0.0015| 2015|       01|   28|2015-01-28| 2015-01-26|        2015-01-01|   2014-12-28|      2014-11-28|      2014-10-28|    2014-07-28|         2014-01-28|   100.9087701781984|    2015-01-02|
  |        Yankees|2015-01-29 00:00:00|348431.75|       2.0E-4| 2015|       01|   29|2015-01-29| 2015-01-26|        2015-01-01|   2014-12-29|      2014-11-29|      2014-10-29|    2014-07-29|         2014-01-29|  100.92895193223403|    2015-01-02|
  |        Yankees|2015-02-28 00:00:00|348431.75|       2.0E-4| 2015|       02|   28|2015-02-28| 2015-01-26|        2015-01-01|   2014-12-28|      2014-11-28|      2014-10-28|    2014-07-28|         2014-01-28|  100.92895193223403|    2015-01-02|
 +--------------+-----------------------+-----------+--------------+-----+-------+----+------------+--------------+------------------+---------------+----------------+-----------------+---------------+-------------------+-------------------------+---------------+

This will work, a window function was the right way to go about it but you don't need the AsofDate col for this这将起作用, window function 是 go 的正确方法,但您不需要 AsofDate col 为此

from pyspark.sql import function as F
from pyspark.sql import Window

w = Window.partitionBy(F.dayofmonth('AsofDate'))
w = w.orderBy(F.Year('AsofDate'),F.month('AsofDate'))

df.withColumn('1MonthAgoValue', F.lag('value').over(w))

In a real dataset you might want to make it more unique and do this by id/for every id, if that's the case add the id col to the partitionBy too.在真实的数据集中,您可能希望使其更加独特,并通过 id/为每个 id 执行此操作,如果是这种情况,请将 id col 也添加到 partitionBy。

PS: if lag() gets you the wrong result use lead() I always forget the order and end up having to try both. PS:如果 lag() 得到错误的结果,请使用 lead() 我总是忘记顺序并最终不得不尝试两者。

You could try a left_outer self join like this.您可以尝试这样的left_outer self join

from pyspark.sql import function as F
df.join(df.select(F.col("AsofDate").alias("oneMonthAgo"),\
                  F.col("value").alias("1MonthAgoValue")),['oneMonthAgo'],'left_outer')\
  .orderBy("AsofDate")\
  .show()

#+-----------+----------+-----+--------------+
#|oneMonthAgo|  AsofDate|value|1MonthAgoValue|
#+-----------+----------+-----+--------------+
#| 2019-02-20|2019-02-23|    2|          null|
#| 2019-02-20|2019-03-20|    7|          null|
#| 2019-02-21|2019-03-21|   12|          null|
#| 2019-02-22|2019-03-22|   27|          null|
#| 2019-02-23|2019-03-23|   91|             2|
#+-----------+----------+-----+--------------+

UPDATE:

Try this:尝试这个:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().partitionBy(F.dayofmonth("AsofDate"))\
          .orderBy(F.to_timestamp("AsofDate").cast("long"))\
          .rangeBetween(86400*-30,0)

first=F.first("value").over(w)

df.withColumn("1MonthAgoValue", F.when(first!=F.col("value"), first)\
                                 .otherwise(F.lit(None))).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM