简体   繁体   English

如何使用pyspark数据框窗口功能

[英]How to use pyspark dataframe window function

I have a dataframe like below我有一个如下所示的数据框

在此处输入图片说明

I want to get a dataframe which will have the most recent version with the latest date.The first filter criteria will be latest version and then latest date The resulting dataframe should look like below我想获得一个数据框,它具有最新版本和最新日期。第一个过滤条件将是最新版本,然后是最新日期结果数据框应如下所示

在此处输入图片说明

I am using window function to achieve this.I have written below piece of code.我正在使用窗口函数来实现这一点。我写了下面的一段代码。

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

I am not sure where I am missing out.I am only getting one output with id 100. Please help me to solve this我不确定我错过了什么。我只得到一个 ID 为 100 的输出。请帮我解决这个问题

As you mentioned there is an order in your operatin : first version then dt Basically, you need to select only maximum version (removing everything else) and then select maximum dt and removing everything else.正如您所提到的,您的操作中有一个顺序:第一个版本,然后是 dt 基本上,您只需要选择最大版本(删除其他所有内容),然后选择最大 dt 并删除其他所有内容。 You just have to switch 2 lines as this :你只需要像这样切换 2 行:

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

The reason why you got only one row for id 100 is because in that case maximum version and maximum dt is happening on the same row (you got lucky). id 100 只有一行的原因是因为在这种情况下最大版本和最大 dt 发生在同一行(你很幸运)。 But it is not true for id 200.但对于 id 200 则不是这样。

Basically there are couple of issues with your formulation.基本上,您的配方存在几个问题。 First you need to change the date from string to it's proper date format.首先,您需要将日期从字符串更改为正确的日期格式。 Then Window in pyspark allows you to specify the ordering of the columns one after the other.然后 pyspark 中的 Window 允许您一个接一个地指定列的顺序。 Then there is rank() function which allows you to rank the results over the Window.然后是rank()函数,它允许您在 Window 上对结果进行排名。 Finally all that remains is to select the first rank.最后剩下的就是选择第一个等级。

from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [
        (100,1,"2020-03-19","Nil1"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-05-19","Ni13"),
        (200,1,"2020-09-19","Jay1"),
        (200,2,"2020-07-19","Jay2"),
        (200,2,"2020-08-19","Jay3"),

      ]

df1Columns = ["id", "version", "dt",  "Name"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("dt",F.to_date(F.to_timestamp("dt", 'yyyy-MM-dd')).alias('dt'))
print("Schema.")
df1.printSchema()
print("Actual initial data")
df1.show(truncate=False)

wind = Window.partitionBy("id").orderBy(F.desc("version"), F.desc("dt"))

df1 = df1.withColumn("rank", F.rank().over(wind))
print("Ranking over the window spec specified")
df1.show(truncate=False)

final_df = df1.filter(F.col("rank") == 1).drop("rank")
print("Filtering the final result by applying the rank == 1 condition")
final_df.show(truncate=False)

Output :输出 :

Schema.
root
 |-- id: long (nullable = true)
 |-- version: long (nullable = true)
 |-- dt: date (nullable = true)
 |-- Name: string (nullable = true)

Actual initial data
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|1      |2020-03-19|Nil1|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-05-19|Ni13|
|200|1      |2020-09-19|Jay1|
|200|2      |2020-07-19|Jay2|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

Ranking over the window spec specified
+---+-------+----------+----+----+
|id |version|dt        |Name|rank|
+---+-------+----------+----+----+
|100|2      |2020-05-19|Ni13|1   |
|100|2      |2020-04-19|Nil2|2   |
|100|2      |2020-04-19|Nil2|2   |
|100|1      |2020-03-19|Nil1|4   |
|200|2      |2020-08-19|Jay3|1   |
|200|2      |2020-07-19|Jay2|2   |
|200|1      |2020-09-19|Jay1|3   |
+---+-------+----------+----+----+

Filtering the final result by applying the rank == 1 condition
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|2      |2020-05-19|Ni13|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

A neater way is perhaps to do the following:一种更简洁的方法可能是执行以下操作:

w = Window.partitionBy("id").orderBy(F.col('version').desc(), F.col('dt').desc())
df1.withColumn('maximum', F.row_number().over(w)).filter('maximum = 1').drop('maximum').show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 PySpark window() 中使用毫秒作为参数。 function 之间的范围? - How to use milliseconds as an argument in PySpark window(). rangebetween function? 如何在 PySpark 中使用窗口函数? - How to use window functions in PySpark? 如何使用滚动窗口函数计算 Pyspark Dataframe 中等于某个值的相邻值的数量? - How to count the number of adjacent values in a Pyspark Dataframe equal to a certain value using a rolling window function? pyspark variable not defined error using window function in dataframe select operation - pyspark variable not defined error using window function in dataframe select operation Pyspark DataFrame - 如何使用变量进行连接? - Pyspark DataFrame - How to use variables to make join? 如何在 Databricks 的 PySpark 中使用 Scala 中创建的 DataFrame - How to Use DataFrame Created in Scala in Databricks' PySpark 如何将功能应用于Pyspark数据框列? - How to apply function to Pyspark dataframe column? 如何使用DataFrames在PySpark中使用窗口函数? - How to use window functions in PySpark using DataFrames? 如何在pyspark的一列上应用窗口功能? - How to apply window function on one column in pyspark? PySpark:将不同的 window 尺寸应用于 Z77BB59DCD89559748E5DB56956C1 中的 dataframe - PySpark: applying varying window sizes to a dataframe in pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM