簡體   English   中英

根據最近的記錄加入pyspark

[英]Join pyspark based on the most recent record

我需要加入這些數據幀:

df0:
+-------------
|id |quantity|
+-------------
|  a|       4|
|  b|       7|
|  c|       6|
|  d|       1|
+-------------
df1:
+------------------------
|id |order_id|order_date|
+------------------------
|  a|       x|2021-01-25|
|  a|       y|2021-01-23|
|  b|       z|2021-01-28|
|  b|       x|2021-01-20|
|  c|       y|2021-01-15|
|  d|       x|2021-01-18|
+------------------------

我想要得到的結果如下:

+----------------------------------
|id |quantity |order_id|order_date|
+----------------------------------
|  a|       4 |       x|2021-01-25|
|  b|       7 |       z|2021-01-28|
|  c|       6 |       y|2021-01-15|
|  d|       1 |       x|2021-01-18|
+----------------------------------

也就是說,我只需要加入基於order_date的最新記錄。

只需將df1id上分組並聚合 max order_date然后將結果與df0

import pyspark.sql.functions as F

result = df0.join(
    df1.groupBy("id").agg(F.max("order_date").alias("order_date")),
    on=["id"]
)

result.show()
#+---+--------+----------+
#| id|quantity|order_date|
#+---+--------+----------+
#|  d|       1|2021-01-18|
#|  c|       6|2021-01-15|
#|  b|       7|2021-01-28|
#|  a|       4|2021-01-25|
#+---+--------+----------+
pyspark

#Importing Libs
import findspark
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import*
findspark.init()

#creating Spark Session
spark = SparkSession.builder.appName("Test").getOrCreate()

#Data
data1=[{'id':'a', 'quantity':4},
{'id':'b', 'quantity':7},
{'id':'c', 'quantity':6},
{'id':'d', 'quantity':1}]

data2=[{'id':'a', 'order_id':'x', 'order_date':'2021-01-25'},
{'id':'a', 'order_id':'y', 'order_date':'2021-01-23'},
{'id':'b', 'order_id':'z', 'order_date':'2021-01-28'},
{'id':'b', 'order_id':'x', 'order_date':'2021-01-20'},
{'id':'c', 'order_id':'y', 'order_date':'2021-01-15'},
{'id':'d', 'order_id':'x', 'order_date':'2021-01-18'}
]

#creating dataframes
df0=spark.createDataFrame(data1)
df1=spark.createDataFrame(data2)
df0.show()
+---+--------+
| id|quantity|
+---+--------+
|  a|       4|
|  b|       7|
|  c|       6|
|  d|       1|
+---+--------+
df1.show()
+---+----------+--------+
| id|order_date|order_id|
+---+----------+--------+
|  a|2021-01-25|       x|
|  a|2021-01-23|       y|
|  b|2021-01-28|       z|
|  b|2021-01-20|       x|
|  c|2021-01-15|       y|
|  d|2021-01-18|       x|
+---+----------+--------+

#Arranging the order_date column by using Aggregate functions

dff1=df1.groupBy("id").agg(F.max("order_date").alias("order_date"))
dff1.show()
+---+----------+
| id|order_date|
+---+----------+
|  d|2021-01-18|
|  c|2021-01-15|
|  b|2021-01-28|
|  a|2021-01-25|
+---+----------+

#applying join and printing result
result=df0.join(dff1,df0['id']==df1['id'], 'inner')
result.show()

+---+--------+---+----------+
| id|quantity| id|order_date|
+---+--------+---+----------+
|  d|       1|  d|2021-01-18|
|  c|       6|  c|2021-01-15|
|  b|       7|  b|2021-01-28|
|  a|       4|  a|2021-01-25|
+---+--------+---+----------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM