如何使用pyspark数据框窗口功能

Question

I have a dataframe like below我有一个如下所示的数据框

I want to get a dataframe which will have the most recent version with the latest date.The first filter criteria will be latest version and then latest date The resulting dataframe should look like below我想获得一个数据框，它具有最新版本和最新日期。第一个过滤条件将是最新版本，然后是最新日期结果数据框应如下所示

I am using window function to achieve this.I have written below piece of code.我正在使用窗口函数来实现这一点。我写了下面的一段代码。

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

I am not sure where I am missing out.I am only getting one output with id 100. Please help me to solve this我不确定我错过了什么。我只得到一个 ID 为 100 的输出。请帮我解决这个问题

Answer 1

As you mentioned there is an order in your operatin : first version then dt Basically, you need to select only maximum version (removing everything else) and then select maximum dt and removing everything else.正如您所提到的，您的操作中有一个顺序：第一个版本，然后是 dt 基本上，您只需要选择最大版本（删除其他所有内容），然后选择最大 dt 并删除其他所有内容。 You just have to switch 2 lines as this :你只需要像这样切换 2 行：

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

The reason why you got only one row for id 100 is because in that case maximum version and maximum dt is happening on the same row (you got lucky). id 100 只有一行的原因是因为在这种情况下最大版本和最大 dt 发生在同一行（你很幸运）。 But it is not true for id 200.但对于 id 200 则不是这样。

Answer 2

Basically there are couple of issues with your formulation.基本上，您的配方存在几个问题。 First you need to change the date from string to it's proper date format.首先，您需要将日期从字符串更改为正确的日期格式。 Then Window in pyspark allows you to specify the ordering of the columns one after the other.然后 pyspark 中的 Window 允许您一个接一个地指定列的顺序。 Then there is rank() function which allows you to rank the results over the Window.然后是rank()函数，它允许您在 Window 上对结果进行排名。 Finally all that remains is to select the first rank.最后剩下的就是选择第一个等级。

from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [
        (100,1,"2020-03-19","Nil1"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-05-19","Ni13"),
        (200,1,"2020-09-19","Jay1"),
        (200,2,"2020-07-19","Jay2"),
        (200,2,"2020-08-19","Jay3"),

      ]

df1Columns = ["id", "version", "dt",  "Name"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("dt",F.to_date(F.to_timestamp("dt", 'yyyy-MM-dd')).alias('dt'))
print("Schema.")
df1.printSchema()
print("Actual initial data")
df1.show(truncate=False)

wind = Window.partitionBy("id").orderBy(F.desc("version"), F.desc("dt"))

df1 = df1.withColumn("rank", F.rank().over(wind))
print("Ranking over the window spec specified")
df1.show(truncate=False)

final_df = df1.filter(F.col("rank") == 1).drop("rank")
print("Filtering the final result by applying the rank == 1 condition")
final_df.show(truncate=False)

Output :输出：

Schema.
root
 |-- id: long (nullable = true)
 |-- version: long (nullable = true)
 |-- dt: date (nullable = true)
 |-- Name: string (nullable = true)

Actual initial data
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|1      |2020-03-19|Nil1|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-05-19|Ni13|
|200|1      |2020-09-19|Jay1|
|200|2      |2020-07-19|Jay2|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

Ranking over the window spec specified
+---+-------+----------+----+----+
|id |version|dt        |Name|rank|
+---+-------+----------+----+----+
|100|2      |2020-05-19|Ni13|1   |
|100|2      |2020-04-19|Nil2|2   |
|100|2      |2020-04-19|Nil2|2   |
|100|1      |2020-03-19|Nil1|4   |
|200|2      |2020-08-19|Jay3|1   |
|200|2      |2020-07-19|Jay2|2   |
|200|1      |2020-09-19|Jay1|3   |
+---+-------+----------+----+----+

Filtering the final result by applying the rank == 1 condition
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|2      |2020-05-19|Ni13|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

Answer 3

A neater way is perhaps to do the following:一种更简洁的方法可能是执行以下操作：

w = Window.partitionBy("id").orderBy(F.col('version').desc(), F.col('dt').desc())
df1.withColumn('maximum', F.row_number().over(w)).filter('maximum = 1').drop('maximum').show()

如何使用pyspark数据框窗口功能

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-11-06 16:28:07

解决方案2
1 2020-11-06 16:28:42

解决方案3
0 2020-11-06 18:09:07

如何使用pyspark数据框窗口功能

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-11-06 16:28:07

解决方案2 1 2020-11-06 16:28:42

解决方案3 0 2020-11-06 18:09:07

解决方案1
1 已采纳 2020-11-06 16:28:07

解决方案2
1 2020-11-06 16:28:42

解决方案3
0 2020-11-06 18:09:07