如何使这个搜索和计数更快？

Question

def count_occurrences(string):
    count = 0
    for text in GENERIC_TEXT_STORE:
        count += text.count(string)
    return count

GENERIC_TEXT_STORE is a list of string. GENERIC_TEXT_STORE 是一个字符串列表。 For example:例如：

GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']

Given a string 'text', I want to find how many times the text, ie 'this', occurs in the GENERIC_TEXT_STORE.给定一个字符串“text”，我想找出文本（即“this”）在 GENERIC_TEXT_STORE 中出现的次数。 If my GENERIC_TEXT_STORE is huge, this is very slow.如果我的 GENERIC_TEXT_STORE 很大，这很慢。 What are the ways to make this search and count much faster?有什么方法可以使这个搜索和计数更快？ For instance, if I split the big GENERIC_TEXT_STORE list into multiple smaller lists, would that be faster?例如，如果我将大的 GENERIC_TEXT_STORE 列表拆分为多个较小的列表，那会更快吗？

If the multiprocessing module is useful here, how to make it possible for this purpose?如果 multiprocessing 模块在这里有用，那么如何使其成为可能？

Answer 1

First, check that your algorithm is actually doing what you want, as suggested in the comments above.首先，按照上面的评论中的建议，检查您的算法是否确实在执行您想要的操作。 The count() method is checking for substring equality, you could probably get a big improvement by refactoring your code to test only complete words assuming that's what you want. count() 方法正在检查子字符串是否相等，通过重构代码以仅测试完整的单词，假设这是您想要的，您可能会获得很大的改进。 Something like this could work as your condition.像这样的事情可以作为你的条件。

any((word==string for word in text.split()))

Multiprocessing would probably help as you could split the list into smaller lists (one per core) then add up all the results when each process finishes (avoid inter-process communication during the execution).多处理可能会有所帮助，因为您可以将列表拆分为较小的列表（每个核心一个），然后在每个进程完成时将所有结果相加（避免执行期间的进程间通信）。 I've found from testing that multi-processing in Python varies quite a bit between operating systems, Windows and Mac can take quite a long time to actually spawn the processes whereas Linux seems to do it much faster.我从测试中发现 Python 中的多处理在操作系统之间差异很大，Windows 和 Mac 可能需要很长时间才能实际生成进程，而 Linux 似乎做得更快。 Some people have said that setting a CPU affinity for each process using pstools is important but I didn't find this made much difference in my case.有人说使用 pstools 为每个进程设置 CPU 亲和性很重要，但我没有发现这对我的情况有太大影响。

Another answer would be to look at using Cython to compile your Python into a C program or alternatively rewrite the whole thing in a faster language, but as you've tagged this answer Python I assume you're not so keen on that.另一个答案是考虑使用 Cython 将您的 Python 编译成 C 程序，或者用一种更快的语言重写整个程序，但是当您将这个答案标记为 Python 时，我认为您对此并不那么热衷。

Answer 2

You can use re .您可以使用re 。

In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']

In [3]: def count_occurrences(string):
   ...:     count = 0
   ...:     for text in GENERIC_TEXT_STORE:
   ...:         count += text.count(string)
   ...:     return count

In [6]: import re

In [7]: def count(_str):
   ...:     return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
   ...:
In [28]: def count1(_str):
    ...:     return ' '.join(GENERIC_TEXT_STORE).count(_str)
    ...:

Now using timeit to analyse the execution time.现在使用timeit来分析执行时间。

when size of GENERIC_TEXT_STORE is 3 .当GENERIC_TEXT_STORE大小为3 。

In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

when size of GENERIC_TEXT_STORE is 15000 .当GENERIC_TEXT_STORE大小为15000 。

In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

when size of GENERIC_TEXT_STORE is 150000当GENERIC_TEXT_STORE大小为150000

In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

when size of GENERIC_TEXT_STORE is over a million ( 1500000 )当GENERIC_TEXT_STORE大小超过一百万 ( 1500000 )

In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

count1 < count < count_occurrences count1 < 计数 < count_occurrences

When the size of GENERIC_TEXT_STORE is large count and count1 are almost 4 to 5 times faster than count_occurrences .当GENERIC_TEXT_STORE的大小很大时count和count1几乎比count_occurrences快 4 到 5 倍。

如何使这个搜索和计数更快？

问题描述

2 个解决方案

解决方案1
1 2020-02-02 08:58:26

解决方案2
0 已采纳 2020-02-02 09:09:56

如何使这个搜索和计数更快？

问题描述

2 个解决方案

解决方案1 1 2020-02-02 08:58:26

解决方案2 0 已采纳 2020-02-02 09:09:56

解决方案1
1 2020-02-02 08:58:26

解决方案2
0 已采纳 2020-02-02 09:09:56