將自定義 function 應用於大型列表需要很長時間

Question

問題：

我有一個長度為 48,000 的單詞列表，我正在嘗試將彼此最接近的可能的 4 個單詞（如果存在的話則更少）分組。 為此，我正在從difflib模塊尋求幫助。

我有兩種方法可以做到這一點。 使用difflib.get_close_matches()獲得 4 個最接近的匹配項，或對單詞列表進行笛卡爾積，並從產品列表中的每個元組中獲取分數。

我有一個適用於較小列表的代碼，但是當列表的長度增加時（在我的例子中是 48k），它會花費大量時間。 我正在為這個問題尋找一個可擴展的解決方案。

重現此類列表的代碼：

import random , string , itertools , difflib
from functools import partial
N = 10
random.seed(123)
words = [''.join(random.choice(string.ascii_lowercase) for i in range(5)) for j in range(10)]

我的嘗試：

1：創建了一個 function，它將在創建笛卡爾積后返回分數。 發布這個我可以對第一個元素進行分組並根據需要獲得前 n 個。

def fun(x) : return difflib.SequenceMatcher(None,*x).ratio()
products = list(itertools.product(words,words))
scores = list(map(fun,products))

2：一個function直接給出最好的n(4)個匹配

f = partial(difflib.get_close_matches , possibilities = words , n=4 , cutoff = 0.4)
matches = list(map(f,words)) #this gives 4 possible matches if presentwords

這也是預期的 output。

兩者都適用於一個小列表，但隨着列表大小的增長，它需要很長時間。 因此我嘗試求助於多處理：

多處理嘗試 1：

將 attempt 1 中的第一個 function ( fun ) 保存在一個 py 文件中，然后導入

import multiprocessing
pool = multiprocessing.Pool(8)
import fun
if__name__ == '__main__':
    score_mlt_pr = pool.map(fun.fun, products ) #products is the cartesian product same as attempt 1
scores_mlt = list(score_mlt_pr)

多處理嘗試 2：

使用與之前嘗試 2 相同的f ，但使用池：

close_matches = list(pool.map(f,words))

使用 Multiprocessing，花費的時間減少了，但對於 1000*48000 個單詞的組合，仍然需要大約 1 小時。

我希望我能為我的問題提供一個明確的例子。 請告知我如何加快我的代碼。

Answer 1

這種方法會有更好的性能。

words = <wordlist>
res = []
while len(words) > 4:
    # get a word from list
    word=words.pop()
    # Find three closest to it
    closest = difflib.get_close_matches(word, possibilities=words, n=3, cutoff=0.4)
    #remove found words from list
    for w in closest:
        words.remove(w)
    #add fourth word to list
    closest.append(word)
    res.append(closest)

您的方法返回與原始列表中的單詞一樣多的 4 個單詞組，但很可能其中一些具有相同的四個單詞。 在我的方法中，每個單詞在所有列表中只出現一次。 因此，如果有 1000 個單詞，您將得到 250 個包含四個單詞的列表。

我用 500 個單詞的列表和 1000 個單詞的列表測試了你的第二種方法。 500 個單詞的運行時間為 1.93796 秒，1000 個單詞的運行時間為 7.75168 秒。 所以時間呈指數增長； 雙 N 導致運行速度慢了近 4 倍。

在我的方法中，500 個單詞的列表用了 0.2435 秒，1000 個單詞的列表用了 0.94891 秒。 所以雙 N 只需要 1.4 倍的時間。 這是例外，因為迭代次數較少（N/4 vs N）並且get_closest_matches運行速度可能更快，可能性更少。

- - 編輯 - -

如果需要創建字典，將列表中的所有單詞作為鍵，你可以這樣做

res = {}
while len(words) > 4:
    # get a word from list
    word=words.pop()
    # Find three closest to it
    closest = difflib.get_close_matches(word, possibilities=words, n=3, cutoff=0.4)
    #remove found words from list
    for w in closest:
        words.remove(w)
    #add fourth word to list
    closest.append(word)
    #add values to result
    for w in closest:
        res[w] = closest
#If there is some "leftover words", add them to result
for w in words:
    res[w] = words

現在 res 在字典中有與列表中唯一單詞一樣多的元素。 唯一的問題是“數據質量”。 隨着列表在迭代過程中縮小，get_closest_match 方法用於查找匹配詞的選項越來越少。 所以最后幾輪沒有找到這個詞的最佳匹配。 另一方面，這種方法和以前的方法一樣快。

結果是否可接受取決於您在哪里使用此數據。

將自定義 function 應用於大型列表需要很長時間

問題描述

1 個解決方案

解決方案1
1 2020-04-24 16:38:49

將自定義 function 應用於大型列表需要很長時間

問題描述

1 個解決方案

解決方案1 1 2020-04-24 16:38:49

解決方案1
1 2020-04-24 16:38:49