简体   繁体   English

NLTK word_tokenizer的Python多处理-函数永远不会完成

[英]Python Multiprocessing of NLTK word_tokenizer - function never completes

I'm performing natural language processing using NLTK on some fairly large datasets and would like to take advantage of all my processor cores. 我正在使用NLTK在一些相当大的数据集上执行自然语言处理,并希望利用我所有的处理器内核。 Seems the multiprocessing module is what I'm after, and when I run the following test code I see all cores are being utilized, but the code never completes. 似乎我正在使用多处理模块,当我运行以下测试代码时,我看到所有内核都在使用,但是代码从未完成。

Executing the same task, without multiprocessing, finishes in approximately one minute. 在不进行多处理的情况下,执行同一任务大约需要一分钟。

Python 2.7.11 on debian. Debian上的Python 2.7.11。

from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp

def open_file(filepath):
    #open and parse file
    file = io.open(filepath, 'rU', encoding='utf-8')
    text = file.read()
    return text

def mp_word_tokenize(text_to_process):
    #word tokenize
    start_time = time.clock()
    pool = mp.Pool(processes=8)
    word_tokens = pool.map(word_tokenize, text_to_process)
    finish_time = time.clock() - start_time
    print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
    return word_tokens

filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)

DEPRECATED 已淘汰

This answer is outdated. 这个答案已经过时了。 Please see https://stackoverflow.com/a/54032108/610569 instead 请改为查看https://stackoverflow.com/a/54032108/610569


Here's a cheater's way to do multi-threading using sframe : 这是使用sframe进行多线程的骗子方法:

>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>> 
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>> 
>>> for _ in range(10):
...     start = time.time()
...     for line in data.split('\n'):
...         x = word_tokenize(line)
...     print ('word_tokenize():\t', time.time() - start)
... 
word_tokenize():     4.058445692062378
word_tokenize():     4.05820369720459
word_tokenize():     4.090051174163818
word_tokenize():     4.210559129714966
word_tokenize():     4.17473030090332
word_tokenize():     4.105806589126587
word_tokenize():     4.082665681838989
word_tokenize():     4.13646936416626
word_tokenize():     4.185062408447266
word_tokenize():     4.085020065307617

>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
...     start = time.time()
...     x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
...     print ('word_tokenize() with sframe:\t', time.time() - start)
... 
word_tokenize() with sframe:     7.174573659896851
word_tokenize() with sframe:     5.072867393493652
word_tokenize() with sframe:     5.129574775695801
word_tokenize() with sframe:     5.10952091217041
word_tokenize() with sframe:     5.015898942947388
word_tokenize() with sframe:     5.037845611572266
word_tokenize() with sframe:     5.015375852584839
word_tokenize() with sframe:     5.016635894775391
word_tokenize() with sframe:     5.155989170074463
word_tokenize() with sframe:     5.132697105407715

>>> for _ in range(10):
...     start = time.time()
...     x = [word_tokenize(line) for line in data.split('\n')]
...     print ('str.split():\t', time.time() - start)
... 
str.split():     4.176181793212891
str.split():     4.116339921951294
str.split():     4.1104896068573
str.split():     4.140819549560547
str.split():     4.103625774383545
str.split():     4.125757694244385
str.split():     4.10755729675293
str.split():     4.177418947219849
str.split():     4.11145281791687
str.split():     4.140623092651367

Note that the speed difference might be because I have something else running on the other cores. 请注意,速度差异可能是因为我在其他内核上还有其他运行。 But given a much larger dataset and dedicated cores, you can really see this scale. 但是,有了更大的数据集和专用内核,您真的可以看到这种规模。

It has been a couple of years and SFrame has seen moved on to become part of turicreate : 已经有两年了, SFrame已经发展成为turicreate一部分:

And the speed up is sort of significant from using the new SFrame (in Python3). 而且,使用新的SFrame (在Python3中)可以提高速度。

In native Python and NLTK: 在原生Python和NLTK中:

from nltk import word_tokenize
from turicreate import SFrame

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
lines = data.split('\n')

%%time
for _ in range(10):
    start = time.time()
    for line in lines:
        x = word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)

[out]: [出]:

word_tokenize():     4.619681119918823
word_tokenize():     4.666991233825684
word_tokenize():     4.452856779098511
word_tokenize():     4.574898958206177
word_tokenize():     4.536381959915161
word_tokenize():     4.522706031799316
word_tokenize():     4.742286682128906
word_tokenize():     4.894973039627075
word_tokenize():     4.813692808151245
word_tokenize():     4.663335800170898
CPU times: user 44.9 s, sys: 330 ms, total: 45.2 s
Wall time: 46.5 s

With SFrame 使用SFrame

sf = SFrame(data.split('\n'))
sf.materialize() # Reads data fully first

%%time

for _ in range(10):
    start = time.time()
    x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
    print ('word_tokenize() with sframe:\t', time.time() - start)

[out]: [出]:

word_tokenize() with sframe:     3.2141151428222656
word_tokenize() with sframe:     3.129708766937256
word_tokenize() with sframe:     3.415634870529175
word_tokenize() with sframe:     3.433109760284424
word_tokenize() with sframe:     3.2390329837799072
word_tokenize() with sframe:     3.236827850341797
word_tokenize() with sframe:     3.3200089931488037
word_tokenize() with sframe:     3.367327928543091
word_tokenize() with sframe:     4.476067066192627
word_tokenize() with sframe:     4.064741134643555
CPU times: user 6.26 s, sys: 471 ms, total: 6.73 s
Wall time: 34.9 s

Note: SFrame is lazily evaluated, .materialize() forces the persistence of the SFrame to disk, committing all lazy evaluated operations. 注意:SFrame是延迟计算的, .materialize()强制.materialize()的持久性到磁盘,并提交所有延迟计算的操作。


With Joblib 与Joblib

Additionally, you can use the "embarrassingly simple" parallelization joblib : 另外,您可以使用“非常简单”的并行化joblib

from joblib import Parallel, delayed

%%time
for _ in range(10):
    start = time.time()
    x = Parallel(n_jobs=4)(delayed(word_tokenize)(line) for line in lines)
    print ('word_tokenize() with joblib:\t', time.time() - start)

[out]: [出]:

word_tokenize() with joblib:     3.009906053543091
word_tokenize() with joblib:     4.92037296295166
word_tokenize() with joblib:     3.3748512268066406
word_tokenize() with joblib:     3.9530580043792725
word_tokenize() with joblib:     4.794445991516113
word_tokenize() with joblib:     3.7257909774780273
word_tokenize() with joblib:     4.811202049255371
word_tokenize() with joblib:     3.9719762802124023
word_tokenize() with joblib:     4.347040891647339
word_tokenize() with joblib:     3.958757162094116
CPU times: user 5.53 s, sys: 1.35 s, total: 6.88 s
Wall time: 40.9 s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM