简体   繁体   English

用另一列的最新值填充数据框列

[英]Filling dataframe column with the latest values of another column

I have two dataframes list1 and list2 that each have a different amount of rows with random indices. 我有两个数据框list1和list2,它们分别具有不同数量的带有随机索引的行。 list1 has ~240,000 rows while list2 has ~390,000 rows. list1有〜240,000行,而list2有〜390,000行。 They are sorted from the earliest time to the latest according to the ['time'] column. 根据['time']列,从最早时间到最新时间对它们进行了排序。 They look roughly like this: 它们大致如下所示:

list1 列表1

     time    rates
299  09:31   1.30
1230 10:34   2.42
32   13:40   1.49
     ...   ...

list2 列表2

     time    Symbol    IV
78   10:31   aqb       7
121  10:59   cdd       3
3240 11:19   oty       4
393  13:54   zqb       8
44   14:13   omu       1
     ... 

Each row on list2 has a ['time'] value. list2上的每一行都有一个['time']值。 I want each row in list2 to have the latest ['rates'] value from list1 that is no later than its own ['time'] value. 我希望list2中的每一行都具有不低于其自身的['time']值的list1中的最新['rates']值。 Until then, the same ['rates'] value can be filled into list2 (sorry I know this is confusing). 在此之前,可以将相同的['rates']值填充到list2中(很抱歉,我知道这很令人困惑)。 An example of the desired result with an explanation is shown below. 下面显示了所需结果的示例并进行了说明。

Desired result 所需结果

     time    Symbol    IV    rates
78   10:31   aqb       7     1.30
121  10:59   cdd       3     2.42
3240 11:19   oty       4     2.42
393  13:54   zqb       8     1.49
44   14:13   omu       1     1.49

The first row in list1 is from 9:31, and the second row is from 10:34. list1中的第一行从9:31开始,第二行从10:34开始。 The first row in list2 is at 10:31, so it should be filled with the ['rates'] value from 9:31 instead of the rates value from 10:34, since 10:34 is later than 10:31. list2的第一行位于10:31,因此应使用9:31的['rates']值代替10:34的rate值,因为10:34晚于10:31。 Next row in list2 is 10:59. list2中的下一行是10:59。 The latest row in list1 that is not after 10:59 is 10:34, so the value 2.42 from 10:34 is filled in. The same the third row in list2 with 11:19. list1中最后一个不在10:59之后的行是10:34,因此将10:34中的值2.42填充。list2中的第三行与11:19相同。

How do I go about accomplishing this without using a for loop to slowly iterrows() through every single row and doing a bunch of the above if else checks that would take an eternity given the few hundred thousand rows in each dataframe? 在不使用for循环缓慢遍历每一行的iterrows()的情况下,如何做到这一点,如果在每个数据帧中只有几十万行的情况下进行其他检查,这些检查将需要一个永恒的时间呢? Thanks! 谢谢!

Using merge_asof 使用merge_asof

df1.time=pd.to_datetime(df1.time,format='%H:%M')
df2.time=pd.to_datetime(df2.time,format='%H:%M')
pd.merge_asof(df2.sort_values('time'),df1.sort_values('time'),on='time',direction = 'backward' )
Out[79]: 
                 time Symbol  IV  rates
0 1900-01-01 10:31:00    aqb   7   1.30
1 1900-01-01 10:59:00    cdd   3   2.42
2 1900-01-01 11:19:00    oty   4   2.42
3 1900-01-01 13:54:00    zqb   8   1.49
4 1900-01-01 14:13:00    omu   1   1.49

I simply merged the two dataframes on ['time'] with an indicator then sorted the new dataframe on ['time']: 我只是简单地将['time']上的两个数据框与一个指标合并,然后在['time']上对新数据框进行了排序:

list2 = list2.merge(list1,how = 'outer', on= ['time'], indicator = True)
list2 = list2.sort_values(['time'])

and then filled rows with 'left_only' indicator that consequently have Nan ['rates'] values with the latest values from rows with an 'right_only' indicator by using: 然后使用“ left_only”指示符填充行,从而通过使用“ right_only”指示符从具有“ right_only”指示符的行中获取具有最新值的Nan ['rates]]值:

list2= list2.fillna(method = 'ffill')

Then dropped the rows from list1 with: 然后使用以下命令从list1中删除行:

list2= list2.loc[list2['_merge']!= 'right_only']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM