Python Multiprocessing of NLTK word_tokenizer - function never completes

Question

I'm performing natural language processing using NLTK on some fairly large datasets and would like to take advantage of all my processor cores. Seems the multiprocessing module is what I'm after, and when I run the following test code I see all cores are being utilized, but the code never completes.

Executing the same task, without multiprocessing, finishes in approximately one minute.

Python 2.7.11 on debian.

from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp

def open_file(filepath):
    #open and parse file
    file = io.open(filepath, 'rU', encoding='utf-8')
    text = file.read()
    return text

def mp_word_tokenize(text_to_process):
    #word tokenize
    start_time = time.clock()
    pool = mp.Pool(processes=8)
    word_tokens = pool.map(word_tokenize, text_to_process)
    finish_time = time.clock() - start_time
    print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
    return word_tokens

filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)

Answer 1

DEPRECATED

This answer is outdated. Please see https://stackoverflow.com/a/54032108/610569 instead

Here's a cheater's way to do multi-threading using sframe :

>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>> 
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>> 
>>> for _ in range(10):
...     start = time.time()
...     for line in data.split('\n'):
...         x = word_tokenize(line)
...     print ('word_tokenize():\t', time.time() - start)
... 
word_tokenize():     4.058445692062378
word_tokenize():     4.05820369720459
word_tokenize():     4.090051174163818
word_tokenize():     4.210559129714966
word_tokenize():     4.17473030090332
word_tokenize():     4.105806589126587
word_tokenize():     4.082665681838989
word_tokenize():     4.13646936416626
word_tokenize():     4.185062408447266
word_tokenize():     4.085020065307617

>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
...     start = time.time()
...     x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
...     print ('word_tokenize() with sframe:\t', time.time() - start)
... 
word_tokenize() with sframe:     7.174573659896851
word_tokenize() with sframe:     5.072867393493652
word_tokenize() with sframe:     5.129574775695801
word_tokenize() with sframe:     5.10952091217041
word_tokenize() with sframe:     5.015898942947388
word_tokenize() with sframe:     5.037845611572266
word_tokenize() with sframe:     5.015375852584839
word_tokenize() with sframe:     5.016635894775391
word_tokenize() with sframe:     5.155989170074463
word_tokenize() with sframe:     5.132697105407715

>>> for _ in range(10):
...     start = time.time()
...     x = [word_tokenize(line) for line in data.split('\n')]
...     print ('str.split():\t', time.time() - start)
... 
str.split():     4.176181793212891
str.split():     4.116339921951294
str.split():     4.1104896068573
str.split():     4.140819549560547
str.split():     4.103625774383545
str.split():     4.125757694244385
str.split():     4.10755729675293
str.split():     4.177418947219849
str.split():     4.11145281791687
str.split():     4.140623092651367

Note that the speed difference might be because I have something else running on the other cores. But given a much larger dataset and dedicated cores, you can really see this scale.

Answer 2

It has been a couple of years and SFrame has seen moved on to become part of turicreate :

And the speed up is sort of significant from using the new SFrame (in Python3).

In native Python and NLTK:

from nltk import word_tokenize
from turicreate import SFrame

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
lines = data.split('\n')

%%time
for _ in range(10):
    start = time.time()
    for line in lines:
        x = word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)

[out]:

word_tokenize():     4.619681119918823
word_tokenize():     4.666991233825684
word_tokenize():     4.452856779098511
word_tokenize():     4.574898958206177
word_tokenize():     4.536381959915161
word_tokenize():     4.522706031799316
word_tokenize():     4.742286682128906
word_tokenize():     4.894973039627075
word_tokenize():     4.813692808151245
word_tokenize():     4.663335800170898
CPU times: user 44.9 s, sys: 330 ms, total: 45.2 s
Wall time: 46.5 s

With SFrame

sf = SFrame(data.split('\n'))
sf.materialize() # Reads data fully first

%%time

for _ in range(10):
    start = time.time()
    x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
    print ('word_tokenize() with sframe:\t', time.time() - start)

[out]:

word_tokenize() with sframe:     3.2141151428222656
word_tokenize() with sframe:     3.129708766937256
word_tokenize() with sframe:     3.415634870529175
word_tokenize() with sframe:     3.433109760284424
word_tokenize() with sframe:     3.2390329837799072
word_tokenize() with sframe:     3.236827850341797
word_tokenize() with sframe:     3.3200089931488037
word_tokenize() with sframe:     3.367327928543091
word_tokenize() with sframe:     4.476067066192627
word_tokenize() with sframe:     4.064741134643555
CPU times: user 6.26 s, sys: 471 ms, total: 6.73 s
Wall time: 34.9 s

Note: SFrame is lazily evaluated, .materialize() forces the persistence of the SFrame to disk, committing all lazy evaluated operations.

With Joblib

Additionally, you can use the "embarrassingly simple" parallelization joblib :

from joblib import Parallel, delayed

%%time
for _ in range(10):
    start = time.time()
    x = Parallel(n_jobs=4)(delayed(word_tokenize)(line) for line in lines)
    print ('word_tokenize() with joblib:\t', time.time() - start)

[out]:

word_tokenize() with joblib:     3.009906053543091
word_tokenize() with joblib:     4.92037296295166
word_tokenize() with joblib:     3.3748512268066406
word_tokenize() with joblib:     3.9530580043792725
word_tokenize() with joblib:     4.794445991516113
word_tokenize() with joblib:     3.7257909774780273
word_tokenize() with joblib:     4.811202049255371
word_tokenize() with joblib:     3.9719762802124023
word_tokenize() with joblib:     4.347040891647339
word_tokenize() with joblib:     3.958757162094116
CPU times: user 5.53 s, sys: 1.35 s, total: 6.88 s
Wall time: 40.9 s

Python Multiprocessing of NLTK word_tokenizer - function never completes

Question

2 answers

solution1
3 2016-02-20 06:05:56

DEPRECATED

solution2
0 2019-01-04 01:35:35

In native Python and NLTK:

With SFrame

With Joblib

Python Multiprocessing of NLTK word_tokenizer - function never completes

Question

2 answers

solution1 3 2016-02-20 06:05:56

DEPRECATED

solution2 0 2019-01-04 01:35:35

In native Python and NLTK:

With SFrame

With Joblib

solution1
3 2016-02-20 06:05:56

solution2
0 2019-01-04 01:35:35