简体   繁体   English

在pandas DataFrame中有效地查找匹配的行(基于内容)

[英]Efficiently find matching rows (based on content) in a pandas DataFrame

I am writing some tests and I am using Pandas DataFrames to house a large dataset ~(600,000 x 10). 我正在编写一些测试,我正在使用Pandas DataFrames来存放大型数据集〜(600,000 x 10)。 I have extracted 10 random rows from the source data (using Stata) and now I want to write a test see if those rows are in the DataFrame in my test suite. 我从源数据中提取了10个随机行(使用Stata),现在我想编写一个测试,看看这些行是否在我的测试套件的DataFrame中。

As a small example 作为一个小例子

np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]

Here raw_data is: 这里raw_data是:

在此输入图像描述

And random_sample is derived to guarantee a match and is: 派生random_sample以保证匹配,并且:

在此输入图像描述

Currently I have written: 目前我写了:

for idx, row in raw_data.iterrows():
    if random_sample.equals(row):
        print "match"
        break

Which works but on the large dataset is very slow. 哪个有效,但在大数据集上非常慢。 Is there a more efficient way to check if an entire row is contained in the DataFrame? 有没有更有效的方法来检查DataFrame中是否包含整行?

BTW: My example also needs to be able to compare np.NaN equality which is why I am using the equals() method BTW:我的例子还需要能够比较np.NaN相等,这就是我使用equals()方法的原因

equals doesn't seem to broadcast, but we can always do the equality comparison manually: equals似乎没有广播,但我们总是可以手动进行相等比较:

>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
              0         1         2         3         4         5         6  \
599999  0.07832  0.064828  0.502513  0.851816  0.976464  0.761231  0.275242   

               7        8         9  
599999  0.426393  0.91632  0.569807  

which is much faster than the iterative version for me (which takes > 30s.) 这比我的迭代版本要快得多(大概需要30秒)。

But since we have lots of rows and relatively few columns, we could loop over the columns, and in the typical case probably cut down substantially on the number of rows to be looked at. 但是由于我们有很多行和相对较少的列,我们可以循环遍历列,并且在典型情况下可能会大幅减少要查看的行数。 For example, something like 例如,像

def finder(df, row):
    for col in df:
        df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
    return df

gives me 给我

>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop

which is roughly an order of magnitude faster, because after the first column there's only one row left. 这大约快一个数量级,因为在第一列之后只剩下一行。

(I think I once had a much slicker way to do this but for the life of me I can't remember it now.) (我想我曾经有过一种更为流畅的方式来做这件事,但对于我的生活,我现在不记得了。)

The best I have come up with is to take a filtering approach which seems to work quite well and prevents a lot of comparisons when the dataset is large: 我提出的最好的方法是采用过滤方法,该方法似乎运行良好,并在数据集很大时阻止了大量的比较:

tmp = raw_data    
for idx, val in random_sample.iteritems():
    try:
        if np.isnan(val):
            continue
    except:
        pass
    tmp = tmp[tmp[idx] == val]
if len(tmp) == 1: print "match"

Note: This is actually a slower for the above small dimensional example. 注意:对于上面的小维度示例,这实际上是较慢的。 But on a large dataset this ~9 times faster than the basic iteration 但是在大型数据集上,这比基本迭代快~9倍

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM