简体   繁体   English

Python:如何快速搜索集合中的子字符串?

[英]Python: how to search for a substring in a set the fast way?

I have a set containing ~300.000 tuples 我有一个包含〜300.000元组的集合

In [26]: sa = set(o.node for o in vrts_l2_5) 
In [27]: len(sa)
Out[27]: 289798
In [31]: random.sample(sa, 1)
Out[31]: [('835644', '4696507')]

Now I want to lookup elements based on a common substring, eg the first 4 'digits' (in fact the elements are strings). 现在,我想基于公共子字符串查找元素,例如前4个“数字”(实际上,元素是字符串)。 This is my approach: 这是我的方法:

def lookup_set(x_appr, y_appr):
    return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]

In [36]: lookup_set('6652','46529')
Out[36]: [('665274', '4652941'), ('665266', '4652956')]

Is there a more efficient, that is, faster way to to this? 有没有更有效的方法,那就是更快的方法?

You can do it in O(log(n) + m) time, where n is the number of tuples and m is the number of matching tuples, if you can afford to keep two sorted copies of the tuples. 您可以在O(log(n) + m)时间内完成此操作,其中n是元组的数量, m是匹配元组的数量( 如果您有能力保留两个排序的元组副本)。 Sorting itself will cost O(nlog(n)) , ie it will be asymptotically slower then your naive approach, but if you have to do a certain number of queries(more than log(n) , which is almost certainly quite small) it will pay off. 排序本身将花费O(nlog(n)) ,即,它将您的幼稚方法渐近地 ,但是如果您必须执行一定数量的查询(大于log(n) ,那几乎肯定是很小的)会得到回报。

The idea is that you can use bisection to find the candidates that have the correct first value and the correct second value and then intersect these sets. 这个想法是,您可以使用二等分法找到具有正确的第一个值和正确的第二个值的候选项,然后与这些集合相交。

However note that you want a strange kind of comparison: you care for all strings starting with the given argument. 但是请注意,您需要进行一种奇怪的比较:您要照顾所有以给定参数开头的字符串。 This simply means that when searching for the right-most occurrence you should fill the key with 9 s. 这仅表示在搜索最右边的事件时,应将键填充9 s。

A complete working(although not tested very much) code: 完整的工作代码(尽管没有经过太多测试):

from random import randint
from operator import itemgetter

first = itemgetter(0)
second = itemgetter(1)

sa = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
f_sorted = sorted(sa, key=first)
s_sorted = sa
s_sorted.sort(key=second)
max_length = max(len(s) for _,s in sa)

# See: bisect module from stdlib
def bisect_right(seq, element, key):
    lo = 0
    hi = len(seq)
    element = element.ljust(max_length, '9')
    while lo < hi:
        mid = (lo+hi)//2
        if element < key(seq[mid]):
            hi = mid
        else:
            lo = mid + 1
    return lo


def bisect_left(seq, element, key):
    lo = 0
    hi = len(seq)
    while lo < hi:
        mid = (lo+hi)//2
        if key(seq[mid]) < element:
            lo = mid + 1
        else:
            hi = mid
    return lo


def lookup_set(x_appr, y_appr):
    x_left = bisect_left(f_sorted, x_appr, key=first)
    x_right = bisect_right(f_sorted, x_appr, key=first)
    x_candidates = f_sorted[x_left:x_right + 1]
    y_left = bisect_left(s_sorted, y_appr, key=second)
    y_right = bisect_right(s_sorted, y_appr, key=second)
    y_candidates = s_sorted[y_left:y_right + 1]
    return set(x_candidates).intersection(y_candidates)

And the comparison with your initial solution: 并与您的初始解决方案进行比较:

In [2]: def lookup_set2(x_appr, y_appr):
   ...:     return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]

In [3]: lookup_set('123', '124')
Out[3]: set([])

In [4]: lookup_set2('123', '124')
Out[4]: []

In [5]: lookup_set('123', '125')
Out[5]: set([])

In [6]: lookup_set2('123', '125')
Out[6]: []

In [7]: lookup_set('12', '125')
Out[7]: set([('12478', '125908'), ('124625', '125184'), ('125494', '125940')])

In [8]: lookup_set2('12', '125')
Out[8]: [('124625', '125184'), ('12478', '125908'), ('125494', '125940')]

In [9]: %timeit lookup_set('12', '125')
1000 loops, best of 3: 589 us per loop

In [10]: %timeit lookup_set2('12', '125')
10 loops, best of 3: 145 ms per loop

In [11]: %timeit lookup_set('123', '125')
10000 loops, best of 3: 102 us per loop

In [12]: %timeit lookup_set2('123', '125')
10 loops, best of 3: 144 ms per loop

As you can see this solution is about 240-1400 times faster(in these examples) than your naive approach. 如您所见,此解决方案比幼稚的方法快约240-1400倍(在这些示例中)。

If you have a big set of matches: 如果您有大量匹配项:

In [19]: %timeit lookup_set('1', '2')
10 loops, best of 3: 27.1 ms per loop

In [20]: %timeit lookup_set2('1', '2')
10 loops, best of 3: 152 ms per loop

In [21]: len(lookup_set('1', '2'))
Out[21]: 3587
In [23]: %timeit lookup_set('', '2')
10 loops, best of 3: 182 ms per loop

In [24]: %timeit lookup_set2('', '2')
1 loops, best of 3: 212 ms per loop

In [25]: len(lookup_set2('', '2'))
Out[25]: 33053

As you can see this solution is faster even if the number of matches is about 10% of the total size. 如您所见,即使匹配数约为总大小的10%,此解决方案也会更快。 However, if you try to match all the data: 但是,如果您尝试匹配所有数据:

In [26]: %timeit lookup_set('', '')
1 loops, best of 3: 360 ms per loop

In [27]: %timeit lookup_set2('', '')
1 loops, best of 3: 221 ms per loop

It becomes (not so much) slower, although this is a quite peculiar case, and I doubt you'll frequently match almost all the elements. 尽管这是一个非常特殊的情况,但它变慢了(不是很多),我怀疑您会经常匹配几乎所有元素。

Note that the time take to sort the data is quite small: 请注意,对数据进行sort所需的时间非常短:

In [13]: from random import randint
    ...: from operator import itemgetter
    ...: 
    ...: first = itemgetter(0)
    ...: second = itemgetter(1)
    ...: 
    ...: sa2 = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]

In [14]: %%timeit
    ...: f_sorted = sorted(sa2, key=first)
    ...: s_sorted = sorted(sa2, key=second)
    ...: max_length = max(len(s) for _,s in sa2)
    ...: 
1 loops, best of 3: 881 ms per loop

As you can see it takes less than one second to do the two sorted copies. 如您所见,完成两个排序的副本所需的时间不到一秒钟。 Actually the above code would be slightly faster since it sorts "in-place" the second copy(although tim-sort could still require O(n) memory). 实际上,上面的代码会稍快一些,因为它可以对第二个副本进行“原位”排序(尽管tim-sort仍然需要O(n)内存)。

This means that if you have to do more than about 6-8 queries this solution will be faster. 这意味着,如果您必须执行大约6-8次以上的查询,则此解决方案会更快。


Note: python'd standard library provides a bisect module. 注意:python'd标准库提供了一个bisect模块。 However it doesn't allow a key parameter(even though I remember reading that Guido wanted it, so it may be added in the future). 但是,它不允许使用key参数(即使我记得曾经读过Guido想要它,所以将来可能会添加它)。 Hence if you want to use it directly, you'll have to use the "decorate-sort-undecorate" idiom. 因此,如果要直接使用它,则必须使用“ decorate-sort-undecorate”惯用语。

Instead of: 代替:

f_sorted = sorted(sa, key=first)

You should do: 你应该做:

f_sorted = sorted((first, (first,second)) for first,second in sa)

Ie you explicitly insert the key as the first element of the tuple. 即您显式插入键作为元组的第一个元素。 Afterwards you could use ('123', '') as element to pass to the bisect_* functions and it should find the correct index. 之后,您可以使用('123', '')作为传递给bisect_*函数的元素,它应该找到正确的索引。

I decided to avoid this. 我决定避免这种情况。 I copy pasted the code from the sources of the module and slightly modified it to provide a simpler interface for your use-case. 我从模块的源代码中复制并粘贴了代码,并对其进行了少许修改以为您的用例提供一个更简单的界面。


Final remark: if you could convert the tuple elements to integers then the comparisons would be faster. 最后一点:如果可以将元组元素转换为整数,则比较会更快。 However, most of the time would still be taken to perform the intersection of the sets, so I don't know exactly how much it will improve performances. 但是,大多数时间仍将花在执行集合的交集上,所以我不知道它将多少提高性能。

Integer manipulation is much faster than string. 整数操作比字符串快得多。 (and smaller in memory as well) (并且内存也较小)

So if you can compare integers instead you'll be much faster. 因此,如果您可以比较整数,则速度会更快。 I suspect something like this should work for you: 我怀疑这样的事情应该为您工作:

sa = set(int(o.node) for o in vrts_l2_5) 

Then this may work for you: 然后这可能为您工作:

def lookup_set(samples, x_appr, x_len, y_appr, y_len):
    """

    x_appr == SSS0000  where S is the digit to search for
    x_len == number of digits to S (if SSS0000 then x_len == 4)
    """
    return ((x, y) for x, y in samples if round(x, -x_len) ==  x_appr and round(y, -y_len) == y_approx)

Also, it returns a generator, so you're not loading all the results into memory at once. 同样,它返回一个生成器,因此您不会立即将所有结果加载到内存中。

Updated to use round method mentioned by Bakuriu 更新为使用Bakuriu提到的舍入方法

You could use a trie data structure . 您可以使用trie数据结构 It is possible to build one with a tree of dict objects (see How to create a TRIE in Python ) but there is a package marisa-trie that implements a memory-efficient version by binding to c++ libraries 可以用一棵dict对象树构建一个对象(请参阅如何在Python中创建TRIE ),但是有一个包marisa-trie可以通过绑定到c ++库来实现内存有效的版本。

I have not used this library before, but playing around with it, I got this working: 我以前没有使用过这个库,但是玩了一下,就可以了:

from random import randint
from marisa_trie import RecordTrie

sa = [(str(randint(1000000,9999999)),str(randint(1000000,9999999))) for i in range(100000)]
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip((unicode(first) for first, _ in sa), sa)),
            RecordTrie(fmt, zip((unicode(second) for _, second in sa), sa)))

def lookup_set(sa_tries, x_appr, y_appr):
    """lookup prefix in the appropriate trie and intersect the result"""
     return (set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & 
             set(item[1] for item in sa_tries[1].items(unicode(y_appr))))

lookup_set(sa_tries, "2", "4")

I went through and implemented the 4 suggested solutions to compare their efficiency. 我仔细研究并实施了4种建议的解决方案,以比较它们的效率。 I ran the tests with different prefix lengths to see how the input would affect performance. 我使用不同的前缀长度运行测试,以查看输入将如何影响性能。 The trie and sorted list performance is definitely sensitive to the length of input with both getting faster as the input gets longer (I think it is actually sensitivity to the size of output since the output gets smaller as the prefix gets longer). 特里和排序列表的性能绝对对输入的长度敏感,随着输入的增加,两者都变得更快(我认为这实际上对输出的大小敏感,因为随着前缀的增加,输出变小)。 However, the sorted set solution is definitely faster in all situations. 但是,排序集解决方案在所有情况下都绝对更快。

In these timing tests, there were 200000 tuples in sa and 10 runs for each method: 在这些时序测试中, sa有200000个元组,每种方法运行10次:

for prefix length 1
  lookup_set_startswith    : min=0.072107 avg=0.073878 max=0.077299
  lookup_set_int           : min=0.030447 avg=0.037739 max=0.045255
  lookup_set_trie          : min=0.111548 avg=0.124679 max=0.147859
  lookup_set_sorted        : min=0.012086 avg=0.013643 max=0.016096
for prefix length 2
  lookup_set_startswith    : min=0.066498 avg=0.069850 max=0.081271
  lookup_set_int           : min=0.027356 avg=0.034562 max=0.039137
  lookup_set_trie          : min=0.006949 avg=0.010091 max=0.032491
  lookup_set_sorted        : min=0.000915 avg=0.000944 max=0.001004
for prefix length 3
  lookup_set_startswith    : min=0.065708 avg=0.068467 max=0.079485
  lookup_set_int           : min=0.023907 avg=0.033344 max=0.043196
  lookup_set_trie          : min=0.000774 avg=0.000854 max=0.000929
  lookup_set_sorted        : min=0.000149 avg=0.000155 max=0.000163
for prefix length 4
  lookup_set_startswith    : min=0.065742 avg=0.068987 max=0.077351
  lookup_set_int           : min=0.026766 avg=0.034558 max=0.052269
  lookup_set_trie          : min=0.000147 avg=0.000167 max=0.000189
  lookup_set_sorted        : min=0.000065 avg=0.000068 max=0.000070

Here's the code: 这是代码:

import random
def random_digits(num_digits):
    return random.randint(10**(num_digits-1), (10**num_digits)-1)

sa = [(str(random_digits(6)),str(random_digits(7))) for _ in range(200000)]

### naive approach
def lookup_set_startswith(x_appr, y_appr):
    return [item for item in sa if item[0].startswith(x_appr) and item[1].startswith(y_appr) ]

### trie approach
from marisa_trie import RecordTrie

# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip([unicode(first) for first, second in sa], sa)),
         RecordTrie(fmt, zip([unicode(second) for first, second in sa], sa)))

def lookup_set_trie(x_appr, y_appr):
 # lookup prefix in the appropriate trie and intersect the result
 return set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & \
        set(item[1] for item in sa_tries[1].items(unicode(y_appr)))

### int approach
sa_ints = [(int(first), int(second)) for first, second in sa]

sa_lens = tuple(map(len, sa[0]))

def lookup_set_int(x_appr, y_appr):
    x_limit = 10**(sa_lens[0]-len(x_appr))
    y_limit = 10**(sa_lens[1]-len(y_appr))

    x_int = int(x_appr) * x_limit
    y_int = int(y_appr) * y_limit

    return [sa[i] for i, int_item in enumerate(sa_ints) \
        if (x_int <= int_item[0] and int_item[0] < x_int+x_limit) and \
           (y_int <= int_item[1] and int_item[1] < y_int+y_limit) ]

### sorted set approach
from operator import itemgetter

first = itemgetter(0)
second = itemgetter(1)

sa_sorted = (sorted(sa, key=first), sorted(sa, key=second))
max_length = max(len(s) for _,s in sa)

# See: bisect module from stdlib
def bisect_right(seq, element, key):
    lo = 0
    hi = len(seq)
    element = element.ljust(max_length, '9')
    while lo < hi:
        mid = (lo+hi)//2
        if element < key(seq[mid]):
            hi = mid
        else:
            lo = mid + 1
    return lo


def bisect_left(seq, element, key):
    lo = 0
    hi = len(seq)
    while lo < hi:
        mid = (lo+hi)//2
        if key(seq[mid]) < element:
            lo = mid + 1
        else:
            hi = mid
    return lo


def lookup_set_sorted(x_appr, y_appr):
    x_left = bisect_left(sa_sorted[0], x_appr, key=first)
    x_right = bisect_right(sa_sorted[0], x_appr, key=first)
    x_candidates = sa_sorted[0][x_left:x_right]
    y_left = bisect_left(sa_sorted[1], y_appr, key=second)
    y_right = bisect_right(sa_sorted[1], y_appr, key=second)
    y_candidates = sa_sorted[1][y_left:y_right]
    return set(x_candidates).intersection(y_candidates)     


####
# test correctness
ntests = 10

candidates = [lambda x, y: set(lookup_set_startswith(x,y)), 
              lambda x, y: set(lookup_set_int(x,y)),
              lookup_set_trie, 
              lookup_set_sorted]
print "checking correctness (or at least consistency)..."
for dlen in range(1,5):
    print "prefix length %d:" % dlen,
    for i in range(ntests):
        print " #%d" % i,
        prefix = map(str, (random_digits(dlen), random_digits(dlen)))
        answers = [c(*prefix) for c in candidates]
        for i, ans in enumerate(answers):
            for j, ans2 in enumerate(answers[i+1:]):
                assert ans == ans2, "answers for %s for #%d and #%d don't match" \
                                    % (prefix, i, j+i+1)
    print


####
# time calls
import timeit
import numpy as np

ntests = 10

candidates = [lookup_set_startswith,
              lookup_set_int,
              lookup_set_trie, 
              lookup_set_sorted]

print "timing..."
for dlen in range(1,5):
    print "for prefix length", dlen

    times = [ [] for c in candidates ]
    for _ in range(ntests):
        prefix = map(str, (random_digits(dlen), random_digits(dlen)))

        for c, c_times in zip(candidates, times):
            tstart = timeit.default_timer()
            trash = c(*prefix)
            c_times.append(timeit.default_timer()-tstart)
    for c, c_times in zip(candidates, times):
        print "  %-25s: min=%f avg=%f max=%f" % (c.func_name, min(c_times), np.mean(c_times), max(c_times))

There may be, but not by terribly much. 可能有,但不是很多。 str.startswith and and are both shortcutting operators (they can return once they find a failure), and indexing tuples is a fast operation. str.startswithand都是快捷方式运算符(一旦发现故障,它们可以返回),对元组建立索引是一种快速的操作。 Most of the time spent here will be from object lookups, such as finding the startswith method for each string. 在这里花费的大部分时间将来自对象查找,例如为每个字符串查找startswith方法。 Probably the most worthwhile option is to run it through Pypy. 也许最值得的选择是通过Pypy运行它。

A faster solution would be to create a dictionary dict and put the first value as a key and the second as a value. 更快的解决方案是创建字典dict,然后将第一个值作为键,第二个值作为值。

  1. Then you would search keys matching x_appr in the ordered key list of dict (the ordered list would allow you to optimize the search in key list with a dichotomy for example). 然后,您将在dict的有序键列表中搜索与x_appr匹配的键(例如,有序列表将允许您使用二分法优化键列表中的搜索)。 This will provide a key list named for example k_list. 这将提供一个名为k_list的密钥列表。

  2. And then lookup for values of dict having a key in k_list and matching y_appr. 然后查找在k_list中具有键并匹配y_appr的dict值。

You can also include the second step (value that match y_appr) before appending to k_list. 您还可以在添加到k_list之前包括第二步(与y_appr匹配的值)。 So that k_list will contains all the key of the correct elements of dict. 这样k_list将包含dict正确元素的所有键。

Here I've just compare 'in' method and 'find' method: 在这里,我只是比较了“输入”方法和“查找”方法:

The CSV input file contains a list of URL CSV输入文件包含URL列表

# -*- coding: utf-8 -*-

### test perfo str in set

import re
import sys
import time
import json
import csv
import timeit

cache = set()

#######################################################################

def checkinCache(c):
  global cache
  for s in cache:
    if c in s:
      return True
  return False

#######################################################################

def checkfindCache(c):
  global cache
  for s in cache:
    if s.find(c) != -1:
      return True
  return False

#######################################################################

print "1/3-loading pages..."
with open("liste_all_meta.csv.clean", "rb") as f:
    reader = csv.reader(f, delimiter=",")
    for i,line in enumerate(reader):
      cache.add(re.sub("'","",line[2].strip()))

print "  "+str(len(cache))+" PAGES IN CACHE"

print "2/3-test IN..."
tstart = timeit.default_timer()
for i in range(0, 1000):
  checkinCache("string to find"+str(i))
print timeit.default_timer()-tstart

print "3/3-test FIND..."
tstart = timeit.default_timer()
for i in range(0, 1000):
  checkfindCache("string to find"+str(i))
print timeit.default_timer()-tstart

print "\n\nBYE\n"

results in seconds: 结果以秒为单位:

1/3-loading pages...
  482897 PAGES IN CACHE
2/3-test IN...
107.765980005
3/3-test FIND...
167.788629055


BYE

so, the 'in' method is faster than 'find' method :) 因此, “输入”方法比“查找”方法快 :)

Have fun 玩得开心

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM