简体   繁体   English

熊猫合并与布尔索引

[英]Pandas Merge vs. Boolean Indexing

I'm using pandas in Python 3.4 to identify matches between two data frames. 我在Python 3.4中使用熊猫来识别两个数据框之间的匹配。 Matches are based on strict equality except for the last column, where close matches (+/- 5) are fine. 匹配基于严格相等,但最后一列除外,其中最接近的匹配(+/- 5)很好。

One data frame contains many rows, and the second is just a single row in this case. 在这种情况下,一个数据帧包含许多行,第二个数据帧仅是一行。 The desired result is a data frame containing a subset of the first data frame which match the row, as mentioned. 如所提到的,期望结果是包含与行匹配的第一数据帧的子集的数据帧。

I went with the concrete solution of boolean indexing first, but this took a while to chug through all of the data, so I tried out the pandas merge function. 我首先使用了布尔索引的具体解决方案,但是花了一些时间才能遍历所有数据,因此我尝试了pandas merge函数。 However, my implementation of merge is even slower on my test data. 但是,我的合并实现在测试数据上甚至更慢。 It runs between 2 and 4 times slower than the boolean indexing. 它的运行速度比布尔索引慢2到4倍。

Here is a test run: 这是一个测试运行:

import pandas as pd
import random
import time

def make_lsts(lst, num, num_choices):
    choices = list(range(0,num_choices))
    [lst.append(random.choice(choices)) for i in range(0,num)]
    return lst

def old_way(test, data):
    t1 = time.time()
    tmp = data[(data.col_1 == test.col_1[0]) &
              (data.col_2 == test.col_2[0]) &
              (data.col_3 == test.col_3[0]) &
              (data.col_4 == test.col_4[0]) &
              (data.col_5 == test.col_5[0]) &
              (data.col_6 == test.col_6[0]) &
              (data.col_7 == test.col_7[0]) &
              (data.col_8 >= (test.col_8[0]-5)) &
              (data.col_8 <= (test.col_8[0]+5))]
    t2 = time.time()
    print('old time:', t2-t1)

def new_way(test, data):
    t1 = time.time()
    tmp = pd.merge(test, data, how='inner', sort=False, copy=False,
                   on=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7'])
    tmp = tmp[(tmp.col_8_y >= (test.col_8[0] - 5)) & (tmp.col_8_y <= (test.col_8[0] + 5))]
    t2 = time.time()
    print('new time:', t2-t1)

if __name__ == '__main__':
    t1 = time.time()
    data = pd.DataFrame({'col_1':make_lsts([], 4000000, 7),
                         'col_2':make_lsts([], 4000000, 3),
                         'col_3':make_lsts([], 4000000, 3),
                         'col_4':make_lsts([], 4000000, 5),
                         'col_5':make_lsts([], 4000000, 4),
                         'col_6':make_lsts([], 4000000, 4),
                         'col_7':make_lsts([], 4000000, 2),
                         'col_8':make_lsts([], 4000000, 20)})

    test = pd.DataFrame({'col_1':[1], 'col_2':[1], 'col_3':[1], 'col_4':[4], 'col_5':[0], 'col_6':[1], 'col_7':[0], 'col_8':[12]})
    t2 = time.time()
    old_way(test, data)
    new_way(test, data)
    print('time building data:', t2-t1)

On my most recent run I see the following: 在我最近的跑步中,我看到以下内容:

 # old time: 0.2209608554840088
 # new time: 0.9070699214935303
 # time building data: 75.05818915367126

Note that even the new method with the merge function uses boolean indexing on the last column dealing with the range of values, but I thought the merge might be able to do the heavy lifting in the problem. 请注意,即使具有合并功能的新方法在处理值范围的最后一列上也使用布尔索引,但是我认为合并可能能够解决这个问题。 This is clearly not the case since the merge on the first columns takes up almost all of the time used in the new method. 显然不是这种情况,因为第一列上的合并几乎占用了新方法中使用的所有时间。

Is it possible to optimize my implementation of the merge function? 是否可以优化合并功能的实现? (Coming from R and data.table, I spent 30 minutes unsuccessfully searching for a way to set the key in a pandas data frame.) Is this just a problem that merge isn't good at handling? (来自R和data.table,我花了30分钟未能成功地找到一种在pandas数据框中设置密钥的方法。)这仅仅是合并不好处理的问题吗? Why is boolean indexing faster than merge in this example? 在此示例中,为什么布尔索引比合并索引更快?

I don't fully understand the memory backend of these approaches, so any insight is appreciated. 我不完全了解这些方法的内存后端,因此不胜感激。

While you can merge on any set of columns, the performance of the merge is going to be best when you are merging on indexes. 尽管可以在任何一组列上进行合并,但是在索引上进行合并时,合并的性能将是最佳的。

If you replace 如果您更换

tmp = pd.merge(test, data, how='inner', sort=False, copy=False,
               on=['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7'])

with

cols = ['col_%i' % (i+1) for i in xrange(7)]
test.set_index(cols, inplace=True)
data.set_index(cols, inplace=True)
tmp = pd.merge(test, data, how='inner', left_index=True, right_index=True)
test.reset_index(inplace=True)
data.reset_index(inplace=True)

Does that run faster? 运行速度更快吗? I haven't tested it, but I think that should help... 我还没有测试过,但是我认为这应该有所帮助...

By indexing the columns you want to merge, the DataFrame will organize the data under the hood in such a way that it knows where to finds values much more quickly than if the data is simply in ordinary columns. 通过为要合并的列建立索引,DataFrame将在后台组织数据,这样它比在普通列中知道数据要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM