简体   繁体   中英

How to use parallel processing to call the same function multiple times?

How to make this task much faster to be finished? The 3 calls of generate_ngrams_from_file() can be done in parallel? Just get into python and don't know how to make it faster. I think multiprocessing or threading should be doing the job, but no idea of how to do it. This looks like a typical task can be done concurrently to use multiple cores on my Mac machine.

def tokenize(text):
   return [token for token in text.split(' ')]

def generate_ngrams(text, n):

    tokens = tokenize(text)

    ngrams = zip(*[tokens[i:] for i in range(n)])

    return [''.join(ngram) for ngram in ngrams]

def generate_ngrams_from_file(input, out, n):
    count = 0
    with open(input, 'r') as f:
        for line in f:
            count += 1
            if line:
                ngrams = generate_ngrams(line, n)
                if n == 2:
                    bigrams.update(ngrams)
                elif n == 3:
                    trigrams.update(ngrams)
                elif n == 4:
                    fourgrams.update(ngrams)
                elif n == 5:
                    fourgrams.update(ngrams)

    print("Ngram done!")

if __name__ == "__main__":
    start = time.time()

    input_file = 'bigfile.txt'
    output_3_tram = '3gram.txt'
    output_4_tram = '4ngram.txt'
    output_5_tram = '5ngram.txt'

    print('Generate trigram: ')
    generate_ngrams_from_file(input_file, output_3_tram, 3)

    print("Generate fourgrams: ")
    generate_ngrams_from_file(input_file, output_4_tram, 4)

    print("Generate fivegrams: ")
    generate_ngrams_from_file(input_file, output_5_tram, 5)

    end = time.time()
    mytime(start, end)

Multithreading in Python is not a very good idea because of the Global Interpreter Lock feature of Python. You can read about it here https://www.geeksforgeeks.org/what-is-the-python-global-interpreter-lock-gil/ . Multiprocessing is a better option to make your programs faster. You can put the generate_ngrams() function inside the Process class of multiprocessing module. Read about the Process class at https://docs.python.org/2/library/multiprocessing.html . Process class is recommended as it is faster than both pool.apply() and pool.apply_async()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM