简体   繁体   English

为字符串列表中的子字符串实现更高效的Python算法

[英]Implementing a More Efficient Python Algorithm for Substrings in Lists of Strings

I'm doing the below in Python 我正在使用Python进行以下操作

I have a problem that involves checking a large list(size n) of input strings to see if the substrings contain any words from a large dictionary(size m). 我有一个问题,涉及检查输入字符串的大列表(大小n),以查看子字符串是否包含来自大字典(大小为m)的任何单词。

I've looked around for efficient algorithms for this problem and found these: https://github.com/laurentluce/python-algorithms/blob/master/algorithms/string_matching.py 我一直在寻找有效的算法解决这个问题,并找到了这些: https//github.com/laurentluce/python-algorithms/blob/master/algorithms/string_matching.py

the Rabin-Karp and KMP matching algorithms.(note that I've replaced the ord() function for the Rabin-Karp with a dictionary for efficiency) Rabin-Karp和KMP匹配算法。(请注意,我已经用效率字典替换了Rabin-Karp的ord()函数)

However, these actually perform slower than using an 'in' operation in Python which uses the Boyer–Moore–Horspool algorithm. 但是,这些实际上比使用Boyer-Moore-Horspool算法的Python中使用'in'操作要慢。 I suppose that this is because the contains() method invoked by 'in' was is implemented in C. 我想这是因为'in'调用的contains()方法是用C实现的。

How can I override this method with the Rabin-Karp for the string class in Python in C? 我怎样才能用Rabin-Karp覆盖这个方法来获取C中Python的字符串类?

You could have a look at cython: 你可以看看cython:

http://docs.cython.org/src/quickstart/cythonize.html http://docs.cython.org/src/quickstart/cythonize.html

I find it easier to write some custom code for a specific operation than overriding a core python structure. 我发现为特定操作编写一些自定义代码比覆盖核心python结构更容易。

You can't, I am sorry to say. 你不能,我很抱歉地说。

The built-in types, being constructed at the time the interpreter is compiled, cannot be patched at run-time. 在编译解释器时构造的内置类型无法在运行时进行修补。 If speed is really so important then you might want to write a C extension type that subclasses the built-in string type but with a different contains method. 如果速度真的如此重要,那么您可能希望编写一个C扩展类型,该类型是内置字符串类型的子类,但具有不同的contains方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM