[英]merge two dataframes based on closest datetime
I have two data sets, one containing air quality data and one containing weather data, each with a column named 'dt' for date and time.我有两个数据集,一个包含空气质量数据,一个包含天气数据,每个数据集都有一个名为“dt”的列,用于表示日期和时间。 However these times do not match exactly.然而,这些时间并不完全匹配。 I would like to join these tables so that the air quality data is retained and the closest time on the weather data is matched and merged.我想加入这些表,以便保留空气质量数据,并匹配和合并天气数据上的最近时间。
df_aq: df_aq:
dt Latitude Longitude ... Speed_kmh PM2.5 PM10
0 11/20/2018 12:16 33.213922 -97.151055 ... 0.35 16.0 86.1
1 11/20/2018 12:16 33.213928 -97.151007 ... 5.01 16.0 86.1
2 11/20/2018 12:16 33.213907 -97.150953 ... 5.27 16.0 86.1
3 11/20/2018 12:16 33.213872 -97.150883 ... 5.03 16.0 86.1
...
364 11/20/2018 12:46 33.209462 -97.148623 ... 0.00 2.8 6.3
365 11/20/2018 12:46 33.209462 -97.148623 ... 0.00 2.8 6.3
366 11/20/2018 12:46 33.209462 -97.148623 ... 0.00 2.8 6.3]
df_weather: df_天气:
USAF WBAN dt DIR SPD ... PCP01 PCP06 PCP24 PCPXX
0 722589 3991 11/20/2018 0:53 360 6 ... 0 ***** ***** *****
1 722589 3991 11/20/2018 1:53 350 6 ... 0 ***** ***** *****
2 722589 3991 11/20/2018 2:53 310 3 ... 0 ***** ***** *****
3 722589 3991 11/20/2018 3:53 330 5 ... 0 ***** ***** *****
4 722589 3991 11/20/2018 4:53 310 6 ... 0 ***** ***** *****
df_aq ranges from 12:16-12:46, and df_weather has data every hour on the 53 minute mark. df_aq 的范围为 12:16-12:46,而 df_weather 在 53 分钟标记处每小时都有数据。 Therefore the closest times would be 11:53 and 12:53, so I would like those two times and the subsequent weather data to merge appropriately with all the data on df_aq因此最接近的时间是 11:53 和 12:53,所以我希望这两个时间和随后的天气数据与 df_aq 上的所有数据适当合并
I've tried experimenting with iloc and Index.get_loc as that seems to be the best way, but I keep getting an error.我尝试过使用 iloc 和 Index.get_loc 进行试验,因为这似乎是最好的方法,但我不断收到错误消息。
I've tried:我试过了:
ctr = df_aq['dt'].count() - 1
startTime = df_aq['dt'][0]
endTime = df_aq['dt'][ctr]
print df_weather.iloc[df_weather.index.get_loc(startTime,method='nearest') or df_weather.index.get_loc(endTime,method='nearest')]
but then I get an error:但后来我收到一个错误:
TypeError: unsupported operand type(s) for -: 'long' and 'str'
I'm not sure what this error means我不确定这个错误是什么意思
Is there a better way to do this than iloc?有没有比 iloc 更好的方法来做到这一点? And if not, what am I doing wrong with this bit of code?如果没有,这段代码我做错了什么?
Thank you very much for any help you can offer.非常感谢您提供的任何帮助。
I'm taking liberty to have an example which i used during my learning :-) , hope that will help to achieve what you are looking.我冒昧地提供一个我在学习期间使用的示例:-),希望这将有助于实现您的目标。
As stated in the comment section you can try special function merge_asof()
for merging Time-series DataFrames如评论部分所述,您可以尝试使用特殊函数merge_asof()
来合并时间序列数据帧
DataFrame First:数据帧第一:
>>> df1
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
DataFrame Second:第二个数据帧:
>>> df2
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
>>> new_df = pd.merge_asof(df1, df2, on='time', by='ticker')
>>> new_df
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
Check the Documentation Doc merge_asof检查文档Doc merge_asof
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.