简体   繁体   中英

The fastest way to loop through a list of 300k dictionaries?

I have a dict like this:

# I have a list of 300k dictionaries similar in format to the following
# I cannot assume the dictionaries are sorted by the "id" key
sentences = {"id": 1, "some_sentence" :"In general, the performance gains that indexes provide for read operations are worth the insertion penalty.",
             "another_sent": "She said, I had a dream you were playing for the Panthers.' I was like, that's weird because I'm in Indianapolis. But life has come full circle and her dream came true.”" }
# the double quotations are not a typo


# the re.findall is meant to split the the sentence into individual words excluding punctuations
temp = []
temp = re.findall(r"[\w']+|[.,!?;]", sentences.get('some_sentence'))
temp += re.findall(r"[\w']+|[.,!?;]", sentences.get('another_sent'))
# temp = ["She", "said", "had",...]

# to delete case-sensitive duplicates so so if "My" and "my" is used in the sentence only "my" is kept
words = list(set(t.lower() for t in temp))

# I need to remove words of length less than 3
for i in words:
    if len(i) < 3:
       words.remove(i)

# put the list of words back into the dict

sentences["Words"] = words. # O(1)

I have a list of 300k dictionaries and right now it takes about 53 seconds to run on my mac I don't really know what else I can do cut the time down from this

things I have tried:

  • I have tried to use enumerate it made it a little slower
  • I have tried to translate to C using the cython library but I did not get enough of an improvement because I could not translate the "re.findall" to C?

any ideas?

This might improve the exec time, and keep yourself out of trouble, as it is not a good idea to modify the list you are iterating over - it is hard to say if the improvement will be marginal or significant without profiling:

replace:

words = list(set(t.lower() for t in temp))
for i in words:
    if len(i) < 3:
       words.remove(i)   # this is an expensive operation on longer lists

with:

words = [word for word in set(t.lower() for t in temp) if len(word) > 3]

You may get a speedup using multiple processes. Since you are on Mac, child processes get a view of the parent memory. By putting the sentence list is a global variable and only passing its index to child processes, you have a reasonably lean way to get data to the subprocesses. Still, the resulting words list needs to be passed back to the parent and that could negate the advantages of a pool. This method doesn't work on Windows which wouldn't see the global variable in the subprocess space.

import multiprocessing as mp
import re

# I have a list of 300k dictionaries similar in format to the following
# I cannot assume the dictionaries are sorted by the "id" key
sentences = {"id": 1, "some_sentence" :"In general, the performance gains that indexes provide for read operations are worth the insertion penalty.",
                 "another_sent": "She said, I had a dream you were playing for the Panthers.' I was like, that's weird because I'm in Indianapolis. But life has come full circle and her dream came true.”" }
# the double quotations are not a typo

# imagine this is the 300k list of dicts
sentence_list = [sentences]

def worker(index):
    sentences = sentence_list[index]
    # the re.findall is meant to split the the sentence into individual words excluding punctuations
    temp = []
    temp = re.findall(r"[\w']+|[.,!?;]", sentences.get('some_sentence'))
    temp += re.findall(r"[\w']+|[.,!?;]", sentences.get('another_sent'))
    # temp = ["She", "said", "had",...]

    # to delete case-sensitive duplicates so so if "My" and "my" is used in the sentence only "my" is kept
    words = list(set(t.lower() for t in temp))

    # I need to remove words of length less than 3
    for i in words:
        if len(i) < 3:
            words.remove(i)

    return index, words

with mp.Pool() as pool:
    for index, words in pool.imap_unordered(worker, range(len(sentence_list))):
        sentence_list[index]["word"] = words

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM