繁体   English   中英

PySpark-按时间间隔加入数据框

[英]PySpark - Join dataframe by time intervals

我有两个数据框

df_fruit= spark.createDataFrame([("Apple", "10:00"),("Orange", "12:35"),("Apple", "11:36"),("Apple","12:48"),("Pear","11:00")], ["Fruit", "Time"])

该数据框存储何时应食用水果。

我还有一个额外的数据框,用于存储一个人何时正确吃了水果以及摄入多少卡路里和公斤。

df_calories= spark.createDataFrame([("Apple", "10:02", "86g", "1cal"),("Orange", "12:39", "75g", "14cal"),("Apple", "10:04", "9g", "47cal"),("Apple","12:46", "25g", "9cal"),("Orange","12:33", "75g", "2cal")], ["Fruit", "Time", "Weight", "Calories"])

我需要按Fruit来加入两个表,但也需要5分钟的时间间隔。 由于可以灵活选择从建议时间开始的5分钟内进行摄取。

这是预期的结果。

+------+-----+-----+------+--------+
| Fruit| Time| Time|Weight|Calories|
+------+-----+-----+------+--------+
| Apple|10:00|10:02|   86g|    1cal|
| Apple|10:00|10:04|    9g|   47cal|
|Orange|12:35|12:39|   75g|   14cal|
|Orange|12:35|12:33|   75g|    2cal|
| Apple|11:36| null|  null|    null|
| Apple|12:48|12:46|   25g|    9cal|
|  Pear|11:00| null|  null|    null|
+------+-----+-----+------+--------+

连接类型应该是左连接,即必须维护所有df_fruit

假设时间间隔固定为5分钟,我们可以在df_fruit中为每个水果创建开始时间和结束时间并将其加入,

>>>import datetime
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import StringType

>>> maxtimeudf = F.udf(lambda x : (datetime.datetime.strptime(x,'%H:%M')+datetime.timedelta(minutes=5)).strftime('%H:%M'),StringType())
>>> mintimeudf = F.udf(lambda x : (datetime.datetime.strptime(x,'%H:%M')+datetime.timedelta(minutes=-5)).strftime('%H:%M'),StringType())

>>> df_fruit = df_fruit.withColumn('starttime',mintimeudf(df_fruit['Time'])).withColumn('endtime',maxtimeudf(df_fruit['Time']))
>>> df_fruit.show()
+------+-----+---------+-------+
| Fruit| Time|starttime|endtime|
+------+-----+---------+-------+
| Apple|10:00|    09:55|  10:05|
|Orange|12:35|    12:30|  12:40|
| Apple|11:36|    11:31|  11:41|
| Apple|12:48|    12:43|  12:53|
|  Pear|11:00|    10:55|  11:05|
+------+-----+---------+-------+
>>> df = df_fruit.join(df_calories,((df_fruit.Fruit == df_calories.Fruit) & (df_calories.Time.between(df_fruit.starttime,df_fruit.endtime))),'left_outer')
>>> df.select(df_fruit['Fruit'],df_fruit['Time'],df_calories['Time'],df_calories['Weight'],df_calories['Calories']).show()
+------+-----+-----+------+--------+
| Fruit| Time| Time|Weight|Calories|
+------+-----+-----+------+--------+
|Orange|12:35|12:39|   75g|   14cal|
|Orange|12:35|12:33|   75g|    2cal|
|  Pear|11:00| null|  null|    null|
| Apple|10:00|10:02|   86g|    1cal|
| Apple|10:00|10:04|    9g|   47cal|
| Apple|11:36| null|  null|    null|
| Apple|12:48|12:46|   25g|    9cal|
+------+-----+-----+------+--------+    

希望这可以帮助 !

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM