減少 python 中嵌套循環的時間復雜度

Question

這是我的代碼。 需要 17 小時才能完成。您能否建議任何替代代碼來減少計算時間？

# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
    for y in dataset2['name']:
        if (fuzz.token_sort_ratio(x,y) > 85):
            matched_pair.append((x,y))
            print((x,y))

我嘗試了不同的但沒有工作（（。

dataset1 - 10krows，dataset2 - 1M 行，fuzz.token_sort_ratio(x,y) - 是一個 function，它采用 2 個參數（2 個字符串）並輸出 integer - 這兩個字符串的相似性

Answer 1

由於 dataframe 在這里沒有真正使用，我將簡單地使用以下兩個列表：

import string
import random

random.seed(18)
dataset1 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]
dataset2 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]

將這兩個列表與您使用fuzzywuzzy 提供的代碼一起使用。 作為第一個更改，您可以使用RapidFuzz （我是作者），它與 FuzzyWuzzy 基本相同，但速度要快得多。 當使用我的測試列表時，這大約是您的代碼的 7 倍。 另一個問題是，當使用 fuzz.token_sort_ratio 時，字符串總是小寫，例如標點符號被刪除。 雖然這對字符串匹配有意義，但您對列表中的每個字符串都執行多次，這在處理更大的列表時會累加。 在這些列表中，僅使用一次 RapidFuzz 和預處理大約是 14 倍。

from rapidfuzz import fuzz, utils

dataset2_processed = [utils.default_process(x) for x in dataset2]
dataset1_processed = [utils.default_process(x) for x in dataset1]

matched_pair = []
for word1, word1_processed in zip(dataset1, dataset1_processed):
    for word2, word2_processed in zip(dataset2, dataset2_processed):
        if fuzz.token_sort_ratio(word1_processed, word2_processed, processor=None, score_cutoff=85):
            matched_pair.append((word1, word2))

減少 python 中嵌套循環的時間復雜度

問題描述

1 個解決方案

解決方案1
1 2020-04-29 11:15:51

減少 python 中嵌套循環的時間復雜度

問題描述

1 個解決方案

解決方案1 1 2020-04-29 11:15:51

解決方案1
1 2020-04-29 11:15:51