[英]How to compare dates from two dataframes and update the value in the column
我有两个参考气象站的数据框:
import pandas as pd
df_shift = pd.DataFrame({'Date': ['2010-10-05', '2010-10-20', '2011-03-15',
'2012-03-22', '2015-01-17', '2015-01-23',
'2015-01-30'],
'Sensor_id': [1024, 1024, 1024, 1024,
2210, 2210, 1010]})
df_station = pd.DataFrame({'Sensor_id': [1024, 1024, 1024, 2210, 2210],
'Sensor_type': ['analog', 'analog', 'analog', 'dig', 'dig'],
'Date': ['2010-10-01', '2010-10-22', '2011-03-14',
'2015-01-13', '2015-01-22']})
我想在 df_station 中创建一个新列,这个名为“new_column”。
我希望此列填充数据框(班次和气象站)的日期字段之间的天数差异较少。
我做了以下代码:
# Starting with a very large value
df_station['new_column'] = 90000
for i in range(0, len(df_station)):
for j in range(0, len(df_shift)):
var_Difference_Date = abs(pd.to_datetime(df_station['Date'].iloc[i],
format='%Y/%m/%d') -
pd.to_datetime(df_shift['Date'].iloc[j], format='%Y/%m/%d'))
if(df_station['Sensor_id'].iloc[i] == df_shift['Sensor_id'].iloc[j]):
if(var_Difference_Date.days < df_station['new_column'].iloc[i]):
df_station['new_column'].loc[i] = var_Difference_Date.days
显示结果,正如预期的那样:
Sensor_id Sensor_type Date new_column
1024 analog 2010-10-01 4
1024 analog 2010-10-22 2
1024 analog 2011-03-14 1
2210 dig 2015-01-13 4
2210 dig 2015-01-22 1
但是,是否有更有效的方法来做到这一点而不必使用两个 For()? 谢谢你。
我们做merge_asof
,使用by
和on
df_station['Date'] = pd.to_datetime(df_station['Date'])
df_shift['Date'] = pd.to_datetime(df_shift['Date'])
df_shift['DIFF'] = df_shift['Date']
df = pd.merge_asof(df_station, df_shift[['Date', 'Sensor_id', 'DIFF']],
on='Date',
by='Sensor_id',
direction='nearest')
df['DIFF'] = (df.Date - df.DIFF).dt.days.abs()
df
Out[377]:
Sensor_id Sensor_type Date DIFF
0 1024 analog 2010-10-01 4
1 1024 analog 2010-10-22 2
2 1024 analog 2011-03-14 1
3 2210 dig 2015-01-13 4
4 2210 dig 2015-01-22 1
# Converting both dates in pandas datetime format
df_shift['Date'] = pd.to_datetime(df_shift['Date'])
df_station['Date'] = pd.to_datetime(df_station['Date'])
# Aggregating for each Sensor_id, all the dates in a list
a = df_shift.groupby(['Sensor_id'])['Date'].apply(list).reset_index(name='dates_list')
# Merging it with the df_station
df_station = df_station.merge(a, on='Sensor_id', how='left')
# Finding LESS number of days
def get_diff(x):
d1, l = x
for i,d2 in enumerate(l):
if i==0:
diff = abs((d2-d1).days)
else:
t = abs((d2-d1).days)
if t<diff:
diff = t
return diff
df_station['new_column'] = df_station[['Date', 'dates_list']].apply(get_diff, axis=1)
df_shift['Date_s'] = pd.to_datetime(df_shift['Date'])
df_station['Date'] = pd.to_datetime(df_station['Date'])
t = pd.merge_asof(df_station, df_shift[['Date_s','Sensor_id']],
left_on='Date',
right_on='Date_s',
direction='nearest')
t = t[t['Sensor_id_x']==t['Sensor_id_y']]
t['new column'] = abs((t['Date_s'] - t['Date']).dt.days)
t.drop(columns=['Date_s','Sensor_id_x'], inplace=True)
t.columns = ['Sensor_type','Date','Sensor_id','new column']
输出
Sensor_type Date Sensor_id new column
0 analog 2010-10-01 1024 4
1 analog 2010-10-22 1024 2
2 analog 2011-03-14 1024 1
3 dig 2015-01-13 2210 4
4 dig 2015-01-22 2210 1
构建输入数据帧:
import pandas as pd
df_shift = pd.DataFrame({'Date': ['2010-10-05', '2010-10-20', '2011-03-15', '2012-03-22', '2015-01-17', '2015-01-23', '2015-01-30'], 'Sensor_id': [1024, 1024, 1024, 1024, 2210, 2210, 1010]})
df_station = pd.DataFrame({'Sensor_id': [1024, 1024, 1024, 2210, 2210], 'Sensor_type': ['analog', 'analog', 'analog', 'dig', 'dig'], 'Date': ['2010-10-01', '2010-10-22', '2011-03-14', '2015-01-13', '2015-01-22']})
df_shift["Date"] = pd.to_datetime(df_shift["Date"]).dt.date
df_station["Date"] = pd.to_datetime(df_station["Date"]).dt.date
合并两个数据框并计算绝对日期差:
df_merge = pd.merge(df_station, df_shift, how="left", on="Sensor_id", suffixes=["_station","_shift"])
df_merge['Date_abs_diff'] = (df_merge.Date_shift - df_merge.Date_station).abs()
合并后的数据框现在是:
>>> df_merge
Date_station Sensor_id Sensor_type Date_shift Date_abs_diff
0 2010-10-01 1024 analog 2010-10-05 4 days
1 2010-10-01 1024 analog 2010-10-20 19 days
2 2010-10-01 1024 analog 2011-03-15 165 days
3 2010-10-01 1024 analog 2012-03-22 538 days
4 2010-10-22 1024 analog 2010-10-05 17 days
5 2010-10-22 1024 analog 2010-10-20 2 days
6 2010-10-22 1024 analog 2011-03-15 144 days
7 2010-10-22 1024 analog 2012-03-22 517 days
8 2011-03-14 1024 analog 2010-10-05 160 days
9 2011-03-14 1024 analog 2010-10-20 145 days
10 2011-03-14 1024 analog 2011-03-15 1 days
11 2011-03-14 1024 analog 2012-03-22 374 days
12 2015-01-13 2210 dig 2015-01-17 4 days
13 2015-01-13 2210 dig 2015-01-23 10 days
14 2015-01-22 2210 dig 2015-01-17 5 days
15 2015-01-22 2210 dig 2015-01-23 1 days
接下来,执行 groupby 计算,取日期差异的最小值:
df_min = df_merge.groupby(by="Date_station")["Date_abs_diff"].agg("min").reset_index()
>>> df_min
Date_station Date_abs_diff
0 2010-10-01 4 days
1 2010-10-22 2 days
2 2011-03-14 1 days
3 2015-01-13 4 days
4 2015-01-22 1 days
最后,将其合并回 df_station 和 cleanup 以获得最终结果:
df_output = pd.merge(df_station, df_min, how="left", left_on="Date", right_on="Date_station")
df_output.drop(columns='Date_station', inplace=True)
df_output.rename(columns={'Date_abs_diff': 'new_column'}, inplace=True)
df_output['new_column'] = df_output['new_column'].dt.days
>>> df_output
Sensor_id Sensor_type Date new_column
0 1024 analog 2010-10-01 4
1 1024 analog 2010-10-22 2
2 1024 analog 2011-03-14 1
3 2210 dig 2015-01-13 4
4 2210 dig 2015-01-22 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.