Python Pandas：比較一列中的兩個數據幀，並返回另一個數據幀中兩個數據幀的行內容

Question

我正在使用兩個csv文件並導入為dataframe，df1和df2
df1有50000行，df2有150000行。
我想比較（遍歷每一行）df2的'時間'和df1，找到時間的差異並返回對應於相似行的所有列的值，保存在df3（ 時間同步 ）
例如，35427949712（df1中的'time'） 最接近或等於 35427949712（df2中的'time'），所以我想將內容返回到df1（'velocity_x'和'yaw'）和df2（'速度'和'偏航'）並保存在df3中
為此，我使用了兩種技術，如代碼所示。
代碼1需要很長時間才能執行72小時，這不是練習，因為我有很多csv文件
代碼2給了我“內存錯誤”，內核死了。

如果考慮到計算時間，內存和功耗（英特爾酷睿i7-6700HQ，8 GB Ram），我會得到一個更強大的問題解決方案，那將會很棒

這是樣本數據，

import pandas as pd
df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860], 
                    'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],
                   'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})

df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705], 
                    'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],
                    'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})

df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])

代碼1

 for index, row in df1.iterrows():
    min=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min:
            min = abs(float(row['time'])-float(rows['time']))
            #storing the position 
            pos = indexer
    df3.loc[index,'time'] = df1['time'][pos]
    df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]
    df3.loc[index,'yaw'] = df1['yaw'][pos]
    df3.loc[index,'velocity'] = df2['velocity'][pos]
    df3.loc[index,'yawrate'] = df2['yawrate'][pos]

碼2

df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)

df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time']  - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)

df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]

Answer 1

您正在尋找pandas.merge_asof 。 它允許您在一個鍵上組合2個DataFrame ，在這種情況下是time ，而不要求它們完全匹配。 你可以選擇一個direction來確定匹配的優先次序，但在這種情況下，顯然你想要nearest

“最近”搜索選擇右側DataFrame中的行，其中“on”鍵與左側鍵的絕對距離最近。

需要注意的是，您需要對merge_asof進行排序才能正常工作。

import pandas as pd

pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
#          time  velocity   yawrate  velocity_x       yaw
#0  35427009860   12.6556 -0.074351     12.5451 -0.078781
#1  35427029728   12.6556 -0.074351     12.5451 -0.078781
#2  35427049705   12.6444 -0.074351     12.5451 -0.078781
#3  35427929709   12.6583 -0.075049     12.5351 -0.079489
#4  35427949712   12.6556 -0.075049     12.5401 -0.079591

請注意您選擇哪個DataFrame作為左框架或右框架，因為這會更改結果。 在這種情況下，我選擇time在df1最接近的絕對距離的time在df2 。

如果右側df鍵重復on則還需要小心，因為對於完全匹配， merge_asof僅將右側df的最后一個排序行合並到左側df ，而不是為每個完全匹配創建多個條目。 如果這是一個問題，您可以先將精確鍵合並以獲得所有組合，然后將余數與asof合並。

Answer 2

只是旁注（不是答案）

    min_delta=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min_delta:
            min_delta = abs(float(row['time'])-float(rows['time']))
            #storing the position
            pos = indexer

可寫成

    diff = np.abs(row['time'] - df2['time'])
    pos = np.argmin(diff)

（總是避免循環）

並且不要使用內置名稱調用您的變量（ min ）

Python Pandas：比較一列中的兩個數據幀，並返回另一個數據幀中兩個數據幀的行內容

問題描述

代碼1

碼2

2 個解決方案

解決方案1
5 已采納 2018-05-20 15:29:52

解決方案2
3 2018-05-20 16:00:24

Python Pandas：比較一列中的兩個數據幀，並返回另一個數據幀中兩個數據幀的行內容

問題描述

代碼1

碼2

2 個解決方案

解決方案1 5 已采納 2018-05-20 15:29:52

解決方案2 3 2018-05-20 16:00:24

解決方案1
5 已采納 2018-05-20 15:29:52

解決方案2
3 2018-05-20 16:00:24