Python Pandas：比较一列中的两个数据帧，并返回另一个数据帧中两个数据帧的行内容

Question

I am working with two csv files and imported as dataframe, df1 and df2 我正在使用两个csv文件并导入为dataframe，df1和df2
df1 has 50000 rows and df2 has 150000 rows. df1有50000行，df2有150000行。
I want to compare (iterate through each row) the 'time' of df2 with df1, find the difference in time and return the values of all column corresponding to similar row, save it in df3 ( time synchronization ) 我想比较（遍历每一行）df2的'时间'和df1，找到时间的差异并返回对应于相似行的所有列的值，保存在df3（ 时间同步 ）
For example, 35427949712 (of 'time' in df1) is nearest or equal to 35427949712 (of 'time' in df2), So I would like to return the contents to df1 ('velocity_x' and 'yaw') and df2 ('velocity' and 'yawrate') and save in df3 例如，35427949712（df1中的'time'） 最接近或等于 35427949712（df2中的'time'），所以我想将内容返回到df1（'velocity_x'和'yaw'）和df2（'速度'和'偏航'）并保存在df3中
For this i used two techniques, shown in code. 为此，我使用了两种技术，如代码所示。
Code 1 takes very long time to execute 72 hours which is not practice since i have lot of csv files 代码1需要很长时间才能执行72小时，这不是练习，因为我有很多csv文件
Code 2 gives me "memory error" and kernel dies. 代码2给了我“内存错误”，内核死了。

Would be great, if I get a more robust solution for the problem considering computational time, memory and power(Intel Core i7-6700HQ, 8 GB Ram) 如果考虑到计算时间，内存和功耗（英特尔酷睿i7-6700HQ，8 GB Ram），我会得到一个更强大的问题解决方案，那将会很棒

Here is the sample data, 这是样本数据，

import pandas as pd
df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860], 
                    'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],
                   'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})

df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705], 
                    'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],
                    'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})

df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])

Code1 代码1

 for index, row in df1.iterrows():
    min=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min:
            min = abs(float(row['time'])-float(rows['time']))
            #storing the position 
            pos = indexer
    df3.loc[index,'time'] = df1['time'][pos]
    df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]
    df3.loc[index,'yaw'] = df1['yaw'][pos]
    df3.loc[index,'velocity'] = df2['velocity'][pos]
    df3.loc[index,'yawrate'] = df2['yawrate'][pos]

Code2 码2

df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)

df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time']  - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)

df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]

Answer 1

You're looking for pandas.merge_asof . 您正在寻找pandas.merge_asof 。 It allows you to combine 2 DataFrame s on a key, in this case time , without the requirement that they are an exact match. 它允许您在一个键上组合2个DataFrame ，在这种情况下是time ，而不要求它们完全匹配。 You can choose a direction for prioritizing the match, but in this case it's obvious that you want nearest 你可以选择一个direction来确定匹配的优先次序，但在这种情况下，显然你想要nearest

A “nearest” search selects the row in the right DataFrame whose 'on' key is closest in absolute distance to the left's key. “最近”搜索选择右侧DataFrame中的行，其中“on”键与左侧键的绝对距离最近。

One caveat is that you need to sort things for merge_asof to work. 需要注意的是，您需要对merge_asof进行排序才能正常工作。

import pandas as pd

pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
#          time  velocity   yawrate  velocity_x       yaw
#0  35427009860   12.6556 -0.074351     12.5451 -0.078781
#1  35427029728   12.6556 -0.074351     12.5451 -0.078781
#2  35427049705   12.6444 -0.074351     12.5451 -0.078781
#3  35427929709   12.6583 -0.075049     12.5351 -0.079489
#4  35427949712   12.6556 -0.075049     12.5401 -0.079591

Just be careful about which DataFrame you choose as the left or right frame, as that changes the result. 请注意您选择哪个DataFrame作为左框架或右框架，因为这会更改结果。 In this case I'm selecting the time in df1 which is closest in absolute distance to the time in df2 . 在这种情况下，我选择time在df1最接近的绝对距离的time在df2 。

You also need to be careful if you have duplicated on keys in the right df because for exact matches, merge_asof only merges the last sorted row of the right df to the left df , instead of creating multiple entries for each exact match. 如果右侧df键重复on则还需要小心，因为对于完全匹配， merge_asof仅将右侧df的最后一个排序行合并到左侧df ，而不是为每个完全匹配创建多个条目。 If that's a problem, you can instead merge the exact keys first to get all of the combinations, and then merge the remainder with asof. 如果这是一个问题，您可以先将精确键合并以获得所有组合，然后将余数与asof合并。

Answer 2

just a side note (as not an answer) 只是旁注（不是答案）

    min_delta=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min_delta:
            min_delta = abs(float(row['time'])-float(rows['time']))
            #storing the position
            pos = indexer

can be written as 可写成

    diff = np.abs(row['time'] - df2['time'])
    pos = np.argmin(diff)

(always avoid for loops) （总是避免循环）

and don't call your vars with a built-in name ( min ) 并且不要使用内置名称调用您的变量（ min ）

Python Pandas：比较一列中的两个数据帧，并返回另一个数据帧中两个数据帧的行内容

问题描述

Code1 代码1

Code2 码2

2 个解决方案

解决方案1
5 已采纳 2018-05-20 15:29:52

解决方案2
3 2018-05-20 16:00:24

Python Pandas：比较一列中的两个数据帧，并返回另一个数据帧中两个数据帧的行内容

问题描述

Code1 代码1

Code2 码2

2 个解决方案

解决方案1 5 已采纳 2018-05-20 15:29:52

解决方案2 3 2018-05-20 16:00:24

解决方案1
5 已采纳 2018-05-20 15:29:52

解决方案2
3 2018-05-20 16:00:24