简体繁体 English

Python：通过嵌入复杂算法来改善子字符串搜索

[英]Python: Improving sub-string search by embedding sophisticated algorithms

原文 2012-09-04 07:40:22 7 1 python/ c/ performance/ algorithm/ substring

I am extending my previous question python efficient substring search , 我正在扩展我之前的问题python高效子字符串搜索，

I am interested to improve the performance of sub-string search implementation, 我有兴趣提高子字符串搜索实现的性能，

Some of the answers from my previous question pointed out that substring search is implemented by using fastsearch that uses an inspired by BM algorithm, here is the source code 我上一个问题的一些答案指出，子字符串搜索是通过使用受BM算法启发的fastsearch实现的，这是源代码

More answers have pointed me to a python implementation of Boyer–Moore Algorithm, Rabin–Karp algorithm. 更多答案将我指向了Boyer-Moore算法，Rabin-Karp算法的python实现。

will it be efficient to embed c code as a good implementation of substring search using those algorithms (BM, Rabin-Karp )? 使用这些算法（BM， Rabin-Karp ）将c代码嵌入为子字符串搜索的良好实现是否会有效？

1 个解决方案

You haven't specified by what you mean by 'efficient'. 您尚未按“有效”的意思指定。 What tradeoffs are you willing to make? 您愿意进行哪些权衡？ Would you be prepared to pay a price in performance loss when initializing a new string? 初始化新字符串时，您准备为性能损失付出代价吗？ When starting the search? 开始搜索时？ Would you trade more memory for more speed? 您会用更多的内存换取更多的速度吗？

The python developers set clear goals when they developed the python string library: python开发人员在开发python字符串库时设定了明确的目标：

should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin's worst-case test 对于所有测试用例（基于现实生活的代码），包括Jim Hugunin的最坏情况的测试，应该比当前的蛮力算法更快。

small setup overhead; 设置开销小； no dynamic allocation in the fast path (O(m) for speed, O(1) for storage) 快速路径中没有动态分配（速度为O（m），存储为O（1））

sublinear search behaviour in good cases (O(n/m)) 在良好情况下的次线性搜索行为（O（n / m））

no worse than the current algorithm in worst case (O(nm)) 最坏情况下不比当前算法差（O（nm））

should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies) 对于8位字符串和16位或32位Unicode字符串都应该很好地工作（没有O（σ）依赖性）

many real-life searches should be good, very few should be worst case 许多现实生活中的搜索应该是好的，很少是最坏的情况

reasonably simple implementation 相当简单的实现

So the devs had set some limits on performance for the search case and the setup case, on storage requirements, and also on maintenance efficiency. 因此，开发人员在搜索案例和设置案例的性能，存储需求以及维护效率方面都设置了一些限制。 Those boundaries ruled out Boyer–Moore (as it requires preprocessing on the searched-for string, a startup cost and a storage cost), and although I see no evidence that the devs considered Rabin-Karp, it can be ruled out on the same grounds (you need to create the hashes and store these). 这些边界排除了Boyer-Moore（因为它需要对搜索到的字符串进行预处理，启动成本和存储成本），尽管我看不到有证据表明开发人员考虑过Rabin-Karp，但可以将其排除在外。基础（您需要创建散列并将其存储）。

The boundaries were set based on a lot of python internals and usage experience. 边界是根据大量 python内部和使用经验设置的。 The above summary wasn't pulled out of thin air, it is merely a summary of that experience. 上面的总结并不是凭空得出的，它仅仅是该经验的总结。

Now, if you have a specific case where your trade-offs can be set differently, then sure, a C implementation of a different algorithm could well beat the standard Python implementation. 现在，如果您有一个特殊的情况，可以对折衷方案进行不同的设置，那么可以肯定的是，使用不同算法的C实现可以很好地胜过标准的Python实现。 But it'll be more efficient according to a different set of criteria. 但是根据一组不同的标准，它会更有效率。

In any case, the Python search algorithm deals with the small strings case. 无论如何，Python搜索算法都会处理小字符串情况。 If you try to apply it to a large body of text, the algorithms will not be able to perform as well as one that makes different choices that work well for large texts. 如果您尝试将其应用于大量文本，则算法的性能将不尽如人意，因为做出不同选择的算法对大型文本非常有效。 And if you had to search for text through 10,000,000 documents you'd want to use some kind of indexing solution instead puny little python string search. 而且，如果您必须通过10,000,000个文档来搜索文本，则需要使用某种索引解决方案，而不是一些小的python字符串搜索。

Compare that to sorting a list of 100 items with the default sort implementation, vs. sorting 10,000,000,000 integers. 相比之下，使用默认排序实现对100个项目的列表进行排序，而不是对10,000,000,000整数进行排序。 In the latter case there are sorting implementations that can easily beat the default Python offer. 在后一种情况下，有一些排序实现可以轻松击败默认的Python提供。

It should also be noted that Python has a history of algorithm innovation; 还应该指出的是，Python具有算法创新的历史。 the standard sort algorithm in Python is TimSort , a new algorithm invented by Tim Peters to fit the pragmatic real-life circumstances the Python interpreter has to deal with. Python中的标准排序算法是TimSort ，这是蒂姆·彼得斯（Tim Peters）发明的一种新算法，用于适应Python解释程序必须处理的实际情况。 That algorithm has since been made the default in Java and the Android platform as well. 此后，该算法已成为Java和Android平台中的默认算法。 Thus, I tend to trust the Python core devs decisions. 因此，我倾向于相信Python核心开发人员的决定。

As far as I know, noone has embedded a different implementation, as replacing the default is not going to work without patching the Python C code. 据我所知，没有人嵌入过其他实现，因为如果不修补Python C代码，替换默认值是行不通的。 You can easily create a specialized string type that implements a different search algorithm, of course. 当然，您可以轻松地创建实现不同搜索算法的专用字符串类型。 There may well be libraries out there that use C for specialized search algorithms that use Boyer-Moore, Rabin-Karp or any other algorithm, as that might well be the better choice for their specific problem domain. 可能存在一些库，其中使用C来搜索使用Boyer-Moore，Rabin-Karp或任何其他算法的专门搜索算法，因为对于它们的特定问题领域，这可能是更好的选择。