Cythonize：检查字符串列表中的单词是否是另一个字符串的 substring

Question

我想遍历输入单词列表list_words并检查是否有任何属于输入字符串。

我尝试对代码进行 cythonize，但是当我对其进行注释时，我看到几乎所有代码都是黄色的，表明有 python 次交互。

不知道我怎么能加快这个：

cpdef cy_check_any_word_is_substring(list_words, string):
    cdef unicode w
    cdef unicode s_lowered =  string.lower()
    for w in list_words:
        if w in s_lowered:
            return True
    return False

例子

# all words in list_words are lower cased
list_words = ['cat', 'dog', 'eat', 'seat']
input_string = 'The animal saw the Dog and started to make noises'

# should return true
cy_check_any_word_is_substring(list_words, input_string)

请注意，如果字符是否大写，我想让代码独立工作（这就是我做string.lower()的原因），我假设输入的单词列表已经降低。

更新

我想知道使用 C++ 的解决方案是否会更快。 我不知道 C++，我试过了

from libcpp.vector cimport vector
from libcpp.string cimport string

cpdef cy_check_any_word_is_substring(vector[string] list_words,string string):
    s_lowered =  string.lower()
    for w in list_words:
        if w in s_lowered:
            return True
    return False

但它会产生错误

Invalid types for 'in' (string, Python object)

Answer 1

C++ 解决方案的基本问题是，如果您向它传递一个 Python 可迭代对象，则存在隐藏的类型转换。 所以它必须遍历整个列表，然后将每个字符串转换为 C++ 字符串。 出于这个原因，我怀疑它会给你带来多少好处。

如果您可以在不进行类型转换的情况下将数据生成为 C++ 向量，那么它可能会更好地工作。 为此，您应该使用cdef function 而不是cpdef function（我很少喜欢cpdef函数，因为它们通常是两个世界中最糟糕的）。

你遇到的具体问题：

C++ 字符串 class 没有.lower() function，因此行s_lowered = string.lower()隐式地将其转换回 Python 字节，然后在其上调用.lower() 。 您必须自己实现.lower （或在 Python 对象上调用.lower后转换为 C++ string ）。
w in s_lowered未针对 C++ 字符串实现。 您想要s_lowered.find(w) != npos （其中npos从cimported libcpp.string ）。

Cythonize：检查字符串列表中的单词是否是另一个字符串的 substring

问题描述

更新

1 个解决方案

解决方案1
0 2023-01-20 18:31:17

Cythonize：检查字符串列表中的单词是否是另一个字符串的 substring

问题描述

更新

1 个解决方案

解决方案1 0 2023-01-20 18:31:17

解决方案1
0 2023-01-20 18:31:17