用另一列的最新值填充数据框列

Question

I have two dataframes list1 and list2 that each have a different amount of rows with random indices. 我有两个数据框list1和list2，它们分别具有不同数量的带有随机索引的行。 list1 has ~240,000 rows while list2 has ~390,000 rows. list1有〜240,000行，而list2有〜390,000行。 They are sorted from the earliest time to the latest according to the ['time'] column. 根据['time']列，从最早时间到最新时间对它们进行了排序。 They look roughly like this: 它们大致如下所示：

list1 列表1

     time    rates
299  09:31   1.30
1230 10:34   2.42
32   13:40   1.49
     ...   ...

list2 列表2

     time    Symbol    IV
78   10:31   aqb       7
121  10:59   cdd       3
3240 11:19   oty       4
393  13:54   zqb       8
44   14:13   omu       1
     ...

Each row on list2 has a ['time'] value. list2上的每一行都有一个['time']值。 I want each row in list2 to have the latest ['rates'] value from list1 that is no later than its own ['time'] value. 我希望list2中的每一行都具有不低于其自身的['time']值的list1中的最新['rates']值。 Until then, the same ['rates'] value can be filled into list2 (sorry I know this is confusing). 在此之前，可以将相同的['rates']值填充到list2中（很抱歉，我知道这很令人困惑）。 An example of the desired result with an explanation is shown below. 下面显示了所需结果的示例并进行了说明。

Desired result 所需结果

     time    Symbol    IV    rates
78   10:31   aqb       7     1.30
121  10:59   cdd       3     2.42
3240 11:19   oty       4     2.42
393  13:54   zqb       8     1.49
44   14:13   omu       1     1.49

The first row in list1 is from 9:31, and the second row is from 10:34. list1中的第一行从9:31开始，第二行从10:34开始。 The first row in list2 is at 10:31, so it should be filled with the ['rates'] value from 9:31 instead of the rates value from 10:34, since 10:34 is later than 10:31. list2的第一行位于10:31，因此应使用9:31的['rates']值代替10:34的rate值，因为10:34晚于10:31。 Next row in list2 is 10:59. list2中的下一行是10:59。 The latest row in list1 that is not after 10:59 is 10:34, so the value 2.42 from 10:34 is filled in. The same the third row in list2 with 11:19. list1中最后一个不在10:59之后的行是10:34，因此将10:34中的值2.42填充。list2中的第三行与11:19相同。

How do I go about accomplishing this without using a for loop to slowly iterrows() through every single row and doing a bunch of the above if else checks that would take an eternity given the few hundred thousand rows in each dataframe? 在不使用for循环缓慢遍历每一行的iterrows（）的情况下，如何做到这一点，如果在每个数据帧中只有几十万行的情况下进行其他检查，这些检查将需要一个永恒的时间呢？ Thanks! 谢谢！

Answer 1

Using merge_asof 使用merge_asof

df1.time=pd.to_datetime(df1.time,format='%H:%M')
df2.time=pd.to_datetime(df2.time,format='%H:%M')
pd.merge_asof(df2.sort_values('time'),df1.sort_values('time'),on='time',direction = 'backward' )
Out[79]: 
                 time Symbol  IV  rates
0 1900-01-01 10:31:00    aqb   7   1.30
1 1900-01-01 10:59:00    cdd   3   2.42
2 1900-01-01 11:19:00    oty   4   2.42
3 1900-01-01 13:54:00    zqb   8   1.49
4 1900-01-01 14:13:00    omu   1   1.49

Answer 2

I simply merged the two dataframes on ['time'] with an indicator then sorted the new dataframe on ['time']: 我只是简单地将['time']上的两个数据框与一个指标合并，然后在['time']上对新数据框进行了排序：

list2 = list2.merge(list1,how = 'outer', on= ['time'], indicator = True)
list2 = list2.sort_values(['time'])

and then filled rows with 'left_only' indicator that consequently have Nan ['rates'] values with the latest values from rows with an 'right_only' indicator by using: 然后使用“ left_only”指示符填充行，从而通过使用“ right_only”指示符从具有“ right_only”指示符的行中获取具有最新值的Nan ['rates]]值：

list2= list2.fillna(method = 'ffill')

Then dropped the rows from list1 with: 然后使用以下命令从list1中删除行：

list2= list2.loc[list2['_merge']!= 'right_only']

用另一列的最新值填充数据框列

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-08-14 17:45:06

解决方案2
0 2018-08-14 17:39:18

用另一列的最新值填充数据框列

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-08-14 17:45:06

解决方案2 0 2018-08-14 17:39:18

解决方案1
2 已采纳 2018-08-14 17:45:06

解决方案2
0 2018-08-14 17:39:18