熊貓合並與布爾索引

Question

我在Python 3.4中使用熊貓來識別兩個數據框之間的匹配。 匹配基於嚴格相等，但最后一列除外，其中最接近的匹配（+/- 5）很好。

在這種情況下，一個數據幀包含許多行，第二個數據幀僅是一行。 如所提到的，期望結果是包含與行匹配的第一數據幀的子集的數據幀。

我首先使用了布爾索引的具體解決方案，但是花了一些時間才能遍歷所有數據，因此我嘗試了pandas merge函數。 但是，我的合並實現在測試數據上甚至更慢。 它的運行速度比布爾索引慢2到4倍。

這是一個測試運行：

import pandas as pd
import random
import time

def make_lsts(lst, num, num_choices):
    choices = list(range(0,num_choices))
    [lst.append(random.choice(choices)) for i in range(0,num)]
    return lst

def old_way(test, data):
    t1 = time.time()
    tmp = data[(data.col_1 == test.col_1[0]) &
              (data.col_2 == test.col_2[0]) &
              (data.col_3 == test.col_3[0]) &
              (data.col_4 == test.col_4[0]) &
              (data.col_5 == test.col_5[0]) &
              (data.col_6 == test.col_6[0]) &
              (data.col_7 == test.col_7[0]) &
              (data.col_8 >= (test.col_8[0]-5)) &
              (data.col_8 <= (test.col_8[0]+5))]
    t2 = time.time()
    print('old time:', t2-t1)

def new_way(test, data):
    t1 = time.time()
    tmp = pd.merge(test, data, how='inner', sort=False, copy=False,
                   on=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7'])
    tmp = tmp[(tmp.col_8_y >= (test.col_8[0] - 5)) & (tmp.col_8_y <= (test.col_8[0] + 5))]
    t2 = time.time()
    print('new time:', t2-t1)

if __name__ == '__main__':
    t1 = time.time()
    data = pd.DataFrame({'col_1':make_lsts([], 4000000, 7),
                         'col_2':make_lsts([], 4000000, 3),
                         'col_3':make_lsts([], 4000000, 3),
                         'col_4':make_lsts([], 4000000, 5),
                         'col_5':make_lsts([], 4000000, 4),
                         'col_6':make_lsts([], 4000000, 4),
                         'col_7':make_lsts([], 4000000, 2),
                         'col_8':make_lsts([], 4000000, 20)})

    test = pd.DataFrame({'col_1':[1], 'col_2':[1], 'col_3':[1], 'col_4':[4], 'col_5':[0], 'col_6':[1], 'col_7':[0], 'col_8':[12]})
    t2 = time.time()
    old_way(test, data)
    new_way(test, data)
    print('time building data:', t2-t1)

在我最近的跑步中，我看到以下內容：

 # old time: 0.2209608554840088
 # new time: 0.9070699214935303
 # time building data: 75.05818915367126

請注意，即使具有合並功能的新方法在處理值范圍的最后一列上也使用布爾索引，但是我認為合並可能能夠解決這個問題。 顯然不是這種情況，因為第一列上的合並幾乎占用了新方法中使用的所有時間。

是否可以優化合並功能的實現？ （來自R和data.table，我花了30分鍾未能成功地找到一種在pandas數據框中設置密鑰的方法。）這僅僅是合並不好處理的問題嗎？ 在此示例中，為什么布爾索引比合並索引更快？

我不完全了解這些方法的內存后端，因此不勝感激。

Answer 1

盡管可以在任何一組列上進行合並，但是在索引上進行合並時，合並的性能將是最佳的。

如果您更換

tmp = pd.merge(test, data, how='inner', sort=False, copy=False,
               on=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7'])

同

cols = ['col_%i' % (i+1) for i in xrange(7)]
test.set_index(cols, inplace=True)
data.set_index(cols, inplace=True)
tmp = pd.merge(test, data, how='inner', left_index=True, right_index=True)
test.reset_index(inplace=True)
data.reset_index(inplace=True)

運行速度更快嗎？ 我還沒有測試過，但是我認為這應該有所幫助...

通過為要合並的列建立索引，DataFrame將在后台組織數據，這樣它比在普通列中知道數據要快得多。

熊貓合並與布爾索引

問題描述

1 個解決方案

解決方案1
1 已采納 2016-01-26 02:25:23

熊貓合並與布爾索引

問題描述

1 個解決方案

解決方案1 1 已采納 2016-01-26 02:25:23

解決方案1
1 已采納 2016-01-26 02:25:23