循環遍歷 300k 字典列表的最快方法？

Question

我有一個這樣的字典：

# I have a list of 300k dictionaries similar in format to the following
# I cannot assume the dictionaries are sorted by the "id" key
sentences = {"id": 1, "some_sentence" :"In general, the performance gains that indexes provide for read operations are worth the insertion penalty.",
             "another_sent": "She said, I had a dream you were playing for the Panthers.' I was like, that's weird because I'm in Indianapolis. But life has come full circle and her dream came true.”" }
# the double quotations are not a typo


# the re.findall is meant to split the the sentence into individual words excluding punctuations
temp = []
temp = re.findall(r"[\w']+|[.,!?;]", sentences.get('some_sentence'))
temp += re.findall(r"[\w']+|[.,!?;]", sentences.get('another_sent'))
# temp = ["She", "said", "had",...]

# to delete case-sensitive duplicates so so if "My" and "my" is used in the sentence only "my" is kept
words = list(set(t.lower() for t in temp))

# I need to remove words of length less than 3
for i in words:
    if len(i) < 3:
       words.remove(i)

# put the list of words back into the dict

sentences["Words"] = words. # O(1)

我有一個 300k 字典的列表，現在在我的 mac 上運行大約需要 53 秒我真的不知道我還能做些什么來減少時間

我嘗試過的事情：

我嘗試使用 enumerate 它使它變慢了一點
我曾嘗試使用 cython 庫將其轉換為 C，但由於無法將“re.findall”轉換為 C，因此沒有得到足夠的改進？

有任何想法嗎？

Answer 1

這可能會改善 exec 時間，並使您免於麻煩，因為修改您正在迭代的列表不是一個好主意 - 如果不進行分析，很難說改進是微不足道的還是顯着的：

代替：

words = list(set(t.lower() for t in temp))
for i in words:
    if len(i) < 3:
       words.remove(i)   # this is an expensive operation on longer lists

和：

words = [word for word in set(t.lower() for t in temp) if len(word) > 3]

Answer 2

您可能會使用多個進程獲得加速。 由於您在 Mac 上，子進程可以查看父內存。 通過將句子列表作為一個全局變量並且僅將其索引傳遞給子進程，您就可以通過一種合理的精簡方式將數據傳送到子進程。 盡管如此，生成的單詞列表需要傳遞回父級，這可能會抵消池的優勢。 此方法不適用於在子進程空間中看不到全局變量的 Windows。

import multiprocessing as mp
import re

# I have a list of 300k dictionaries similar in format to the following
# I cannot assume the dictionaries are sorted by the "id" key
sentences = {"id": 1, "some_sentence" :"In general, the performance gains that indexes provide for read operations are worth the insertion penalty.",
                 "another_sent": "She said, I had a dream you were playing for the Panthers.' I was like, that's weird because I'm in Indianapolis. But life has come full circle and her dream came true.”" }
# the double quotations are not a typo

# imagine this is the 300k list of dicts
sentence_list = [sentences]

def worker(index):
    sentences = sentence_list[index]
    # the re.findall is meant to split the the sentence into individual words excluding punctuations
    temp = []
    temp = re.findall(r"[\w']+|[.,!?;]", sentences.get('some_sentence'))
    temp += re.findall(r"[\w']+|[.,!?;]", sentences.get('another_sent'))
    # temp = ["She", "said", "had",...]

    # to delete case-sensitive duplicates so so if "My" and "my" is used in the sentence only "my" is kept
    words = list(set(t.lower() for t in temp))

    # I need to remove words of length less than 3
    for i in words:
        if len(i) < 3:
            words.remove(i)

    return index, words

with mp.Pool() as pool:
    for index, words in pool.imap_unordered(worker, range(len(sentence_list))):
        sentence_list[index]["word"] = words

循環遍歷 300k 字典列表的最快方法？

問題描述

2 個解決方案

解決方案1
1 2020-11-22 23:54:38

解決方案2
0 2020-11-23 00:16:41

循環遍歷 300k 字典列表的最快方法？

問題描述

2 個解決方案

解決方案1 1 2020-11-22 23:54:38

解決方案2 0 2020-11-23 00:16:41

解決方案1
1 2020-11-22 23:54:38

解決方案2
0 2020-11-23 00:16:41