str.find 怎么這么快？

Question

我有一個較早的問題，我在迭代字符串和使用切片時尋找 substring。 事實證明，這對性能來說是一個非常糟糕的主意。 str.find要快得多。 但我不明白為什么？

import random
import string
import timeit

# Generate 1 MB of random string data
haystack = "".join(random.choices(string.ascii_lowercase, k=1_000_000))

def f():
    return [i for i in range(len(haystack)) if haystack[i : i + len(needle)] == needle]

def g():
    return [i for i in range(len(haystack)) if haystack.startswith(needle, i)]

def h():
    def find(start=0):
        while True:
            position = haystack.find(needle, start)
            if position < 0:
                return
            start = position + 1
            yield position
    return list(find())

number = 100
needle = "abcd"
expectation = f()
for func in "fgh":
    assert eval(func + "()") == expectation
    t = timeit.timeit(func + "()", globals=globals(), number=number)
    print(func, t)

結果：

f 26.46937609199813
g 16.11952730899793
h 0.07721933699940564

Answer 1

f和g很慢，因為它們檢查是否可以在haystack的每個可能位置找到needle ，從而導致O(nm)的復雜性。 f較慢，因為創建新字符串 object 的切片操作（正如 Barmar 在評論中指出的那樣）。

h很快，因為它可以跳過很多位置。 例如，如果沒有找到needle串，則只執行一次find 。 內置find function 在 C 中進行了高度優化，因此比解釋的純 Python 代碼更快。 此外， find function 使用一種稱為Crochemore 和 Perrin 的雙向算法的高效算法。 當字符串比較大時，該算法比在haystack的每個可能位置都搜索needle快得多。 相關的 CPython 代碼可在此處獲得。

如果出現的次數比較少，你的實現應該已經很好了。 否則，最好使用基於可能是KMP 算法的 CPTW 算法的自定義變體，但在純 Python 中這樣做會非常低效。 您可以在 C 或使用 Cython 中執行此操作。 話雖這么說，這不是一件容易的事，也不是很好維護。

Answer 2

內置的Python函數是在C中實現的，這樣可以快很多。 不可能使 function 在使用 Python 時性能一樣好。

str.find 怎么這么快？

問題描述

2 個解決方案

解決方案1
1 2022-05-02 00:46:11

解決方案2
1 2022-05-02 00:54:39

str.find 怎么這么快？

問題描述

2 個解決方案

解決方案1 1 2022-05-02 00:46:11

解決方案2 1 2022-05-02 00:54:39

解決方案1
1 2022-05-02 00:46:11

解決方案2
1 2022-05-02 00:54:39