[英]The fastest way to loop through a list of 300k dictionaries?
I have a dict like this:我有一个这样的字典:
# I have a list of 300k dictionaries similar in format to the following
# I cannot assume the dictionaries are sorted by the "id" key
sentences = {"id": 1, "some_sentence" :"In general, the performance gains that indexes provide for read operations are worth the insertion penalty.",
"another_sent": "She said, I had a dream you were playing for the Panthers.' I was like, that's weird because I'm in Indianapolis. But life has come full circle and her dream came true.”" }
# the double quotations are not a typo
# the re.findall is meant to split the the sentence into individual words excluding punctuations
temp = []
temp = re.findall(r"[\w']+|[.,!?;]", sentences.get('some_sentence'))
temp += re.findall(r"[\w']+|[.,!?;]", sentences.get('another_sent'))
# temp = ["She", "said", "had",...]
# to delete case-sensitive duplicates so so if "My" and "my" is used in the sentence only "my" is kept
words = list(set(t.lower() for t in temp))
# I need to remove words of length less than 3
for i in words:
if len(i) < 3:
words.remove(i)
# put the list of words back into the dict
sentences["Words"] = words. # O(1)
I have a list of 300k dictionaries and right now it takes about 53 seconds to run on my mac I don't really know what else I can do cut the time down from this我有一个 300k 字典的列表,现在在我的 mac 上运行大约需要 53 秒我真的不知道我还能做些什么来减少时间
things I have tried:我尝试过的事情:
any ideas?有任何想法吗?
This might improve the exec time, and keep yourself out of trouble, as it is not a good idea to modify the list you are iterating over - it is hard to say if the improvement will be marginal or significant without profiling:这可能会改善 exec 时间,并使您免于麻烦,因为修改您正在迭代的列表不是一个好主意 - 如果不进行分析,很难说改进是微不足道的还是显着的:
replace:代替:
words = list(set(t.lower() for t in temp))
for i in words:
if len(i) < 3:
words.remove(i) # this is an expensive operation on longer lists
with:和:
words = [word for word in set(t.lower() for t in temp) if len(word) > 3]
You may get a speedup using multiple processes.您可能会使用多个进程获得加速。 Since you are on Mac, child processes get a view of the parent memory.由于您在 Mac 上,子进程可以查看父内存。 By putting the sentence list is a global variable and only passing its index to child processes, you have a reasonably lean way to get data to the subprocesses.通过将句子列表作为一个全局变量并且仅将其索引传递给子进程,您就可以通过一种合理的精简方式将数据传送到子进程。 Still, the resulting words list needs to be passed back to the parent and that could negate the advantages of a pool.尽管如此,生成的单词列表需要传递回父级,这可能会抵消池的优势。 This method doesn't work on Windows which wouldn't see the global variable in the subprocess space.此方法不适用于在子进程空间中看不到全局变量的 Windows。
import multiprocessing as mp
import re
# I have a list of 300k dictionaries similar in format to the following
# I cannot assume the dictionaries are sorted by the "id" key
sentences = {"id": 1, "some_sentence" :"In general, the performance gains that indexes provide for read operations are worth the insertion penalty.",
"another_sent": "She said, I had a dream you were playing for the Panthers.' I was like, that's weird because I'm in Indianapolis. But life has come full circle and her dream came true.”" }
# the double quotations are not a typo
# imagine this is the 300k list of dicts
sentence_list = [sentences]
def worker(index):
sentences = sentence_list[index]
# the re.findall is meant to split the the sentence into individual words excluding punctuations
temp = []
temp = re.findall(r"[\w']+|[.,!?;]", sentences.get('some_sentence'))
temp += re.findall(r"[\w']+|[.,!?;]", sentences.get('another_sent'))
# temp = ["She", "said", "had",...]
# to delete case-sensitive duplicates so so if "My" and "my" is used in the sentence only "my" is kept
words = list(set(t.lower() for t in temp))
# I need to remove words of length less than 3
for i in words:
if len(i) < 3:
words.remove(i)
return index, words
with mp.Pool() as pool:
for index, words in pool.imap_unordered(worker, range(len(sentence_list))):
sentence_list[index]["word"] = words
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.