简体   繁体   中英

Implementing a More Efficient Python Algorithm for Substrings in Lists of Strings

I'm doing the below in Python

I have a problem that involves checking a large list(size n) of input strings to see if the substrings contain any words from a large dictionary(size m).

I've looked around for efficient algorithms for this problem and found these: https://github.com/laurentluce/python-algorithms/blob/master/algorithms/string_matching.py

the Rabin-Karp and KMP matching algorithms.(note that I've replaced the ord() function for the Rabin-Karp with a dictionary for efficiency)

However, these actually perform slower than using an 'in' operation in Python which uses the Boyer–Moore–Horspool algorithm. I suppose that this is because the contains() method invoked by 'in' was is implemented in C.

How can I override this method with the Rabin-Karp for the string class in Python in C?

You could have a look at cython:

http://docs.cython.org/src/quickstart/cythonize.html

I find it easier to write some custom code for a specific operation than overriding a core python structure.

You can't, I am sorry to say.

The built-in types, being constructed at the time the interpreter is compiled, cannot be patched at run-time. If speed is really so important then you might want to write a C extension type that subclasses the built-in string type but with a different contains method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM