[英]pyspark - read df by row to search in another df
我是 pyspark 的新手,我需要帮助在 df 中进行搜索。
我有 df1 与学生数据如下
+---------+----------+--------------------+
|studentid| course | registration_date |
+---------+----------+--------------------+
| 348| 2| 15-11-2021 |
| 567| 1| 05-11-2021 |
| 595| 3| 15-10-2021 |
| 580| 2| 06-11-2021 |
| 448| 4| 15-09-2021 |
+---------+----------+--------------------+
df2。 有关注册期的信息如下
+--------+------------+------------+
| period | start_date | end_date |
+--------+------------+------------+
| 1| 01-09-2021 | 15-09-2021 |
| 2| 16-09-2021 | 30-09-2021 |
| 3| 01-10-2021 | 15-10-2021 |
| 4| 16-10-2021 | 31-10-2021 |
| 5| 01-11-2021 | 15-11-2021 |
| 6| 16-11-2021 | 30-11-2021 |
+--------+------------+------------+
我需要逐行迭代 df1,获取学生注册日期并使用此日期,转到 df2 并获取条件 df2.start_date <= df1.registration_date <= df2.end_date 的期间信息。
结果将是新的 df 如下
+---------+----------+--------------------+--------+------------+------------+
|studentid| course | registration_date | period | start_date | end_date |
+---------+----------+--------------------+--------+------------+------------+
| 348| 2| 15-11-2021 | 5| 01-11-2021 | 15-11-2021 |
| 567| 1| 05-11-2021 | 5| 01-11-2021 | 15-11-2021 |
| 595| 3| 15-10-2021 | 3| 01-10-2021 | 15-10-2021 |
| 580| 2| 06-11-2021 | 5| 01-11-2021 | 15-11-2021 |
| 448| 4| 15-09-2021 | 1| 01-09-2021 | 15-09-2021 |
+---------+----------+--------------------+--------+------------+------------+
您可以将join
条件指定为复杂条件。
from datetime import datetime
from pyspark.sql import functions as F
df = spark.createDataFrame([
(348, 2, datetime.strptime("15-11-2021", "%d-%m-%Y")),
(567, 1, datetime.strptime("05-11-2021", "%d-%m-%Y")),
(595, 3, datetime.strptime("15-10-2021", "%d-%m-%Y")),
(580, 2, datetime.strptime("06-11-2021", "%d-%m-%Y")),
(448, 4, datetime.strptime("15-09-2021", "%d-%m-%Y")),]
, ("studentid", "course", "registration_date",)).withColumn("registration_date", F.to_date(F.col("registration_date")))
df2 = spark.createDataFrame([
(1, datetime.strptime("01-09-2021", "%d-%m-%Y"), datetime.strptime("15-09-2021", "%d-%m-%Y")),
(2, datetime.strptime("16-09-2021", "%d-%m-%Y"), datetime.strptime("30-09-2021", "%d-%m-%Y")),
(3, datetime.strptime("01-10-2021", "%d-%m-%Y"), datetime.strptime("15-10-2021", "%d-%m-%Y")),
(4, datetime.strptime("16-10-2021", "%d-%m-%Y"), datetime.strptime("31-10-2021", "%d-%m-%Y")),
(5, datetime.strptime("01-11-2021", "%d-%m-%Y"), datetime.strptime("15-11-2021", "%d-%m-%Y")),
(6, datetime.strptime("16-11-2021", "%d-%m-%Y"), datetime.strptime("30-11-2021", "%d-%m-%Y")),]
, ("period", "start_date", "end_date")).withColumn("start_date", F.to_date(F.col("start_date"))).withColumn("end_date", F.to_date(F.col("end_date")))
df.join(df2, (df2["start_date"] <= df["registration_date"]) & (df["registration_date"] <= df2["end_date"])).show()
+---------+------+-----------------+------+----------+----------+
|studentid|course|registration_date|period|start_date| end_date|
+---------+------+-----------------+------+----------+----------+
| 348| 2| 2021-11-15| 5|2021-11-01|2021-11-15|
| 567| 1| 2021-11-05| 5|2021-11-01|2021-11-15|
| 595| 3| 2021-10-15| 3|2021-10-01|2021-10-15|
| 448| 4| 2021-09-15| 1|2021-09-01|2021-09-15|
| 580| 2| 2021-11-06| 5|2021-11-01|2021-11-15|
+---------+------+-----------------+------+----------+----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.