NLTK word_tokenizer的Python多處理-函數永遠不會完成

Question

我正在使用NLTK在一些相當大的數據集上執行自然語言處理，並希望利用我所有的處理器內核。 似乎我正在使用多處理模塊，當我運行以下測試代碼時，我看到所有內核都在使用，但是代碼從未完成。

在不進行多處理的情況下，執行同一任務大約需要一分鍾。

Debian上的Python 2.7.11。

from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp

def open_file(filepath):
    #open and parse file
    file = io.open(filepath, 'rU', encoding='utf-8')
    text = file.read()
    return text

def mp_word_tokenize(text_to_process):
    #word tokenize
    start_time = time.clock()
    pool = mp.Pool(processes=8)
    word_tokens = pool.map(word_tokenize, text_to_process)
    finish_time = time.clock() - start_time
    print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
    return word_tokens

filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)

Answer 1

已淘汰

這個答案已經過時了。 請改為查看https://stackoverflow.com/a/54032108/610569

這是使用sframe進行多線程的騙子方法：

>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>> 
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>> 
>>> for _ in range(10):
...     start = time.time()
...     for line in data.split('\n'):
...         x = word_tokenize(line)
...     print ('word_tokenize():\t', time.time() - start)
... 
word_tokenize():     4.058445692062378
word_tokenize():     4.05820369720459
word_tokenize():     4.090051174163818
word_tokenize():     4.210559129714966
word_tokenize():     4.17473030090332
word_tokenize():     4.105806589126587
word_tokenize():     4.082665681838989
word_tokenize():     4.13646936416626
word_tokenize():     4.185062408447266
word_tokenize():     4.085020065307617

>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
...     start = time.time()
...     x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
...     print ('word_tokenize() with sframe:\t', time.time() - start)
... 
word_tokenize() with sframe:     7.174573659896851
word_tokenize() with sframe:     5.072867393493652
word_tokenize() with sframe:     5.129574775695801
word_tokenize() with sframe:     5.10952091217041
word_tokenize() with sframe:     5.015898942947388
word_tokenize() with sframe:     5.037845611572266
word_tokenize() with sframe:     5.015375852584839
word_tokenize() with sframe:     5.016635894775391
word_tokenize() with sframe:     5.155989170074463
word_tokenize() with sframe:     5.132697105407715

>>> for _ in range(10):
...     start = time.time()
...     x = [word_tokenize(line) for line in data.split('\n')]
...     print ('str.split():\t', time.time() - start)
... 
str.split():     4.176181793212891
str.split():     4.116339921951294
str.split():     4.1104896068573
str.split():     4.140819549560547
str.split():     4.103625774383545
str.split():     4.125757694244385
str.split():     4.10755729675293
str.split():     4.177418947219849
str.split():     4.11145281791687
str.split():     4.140623092651367

請注意，速度差異可能是因為我在其他內核上還有其他運行。 但是，有了更大的數據集和專用內核，您真的可以看到這種規模。

Answer 2

已經有兩年了， SFrame已經發展成為turicreate一部分：

而且，使用新的SFrame （在Python3中）可以提高速度。

在原生Python和NLTK中：

from nltk import word_tokenize
from turicreate import SFrame

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
lines = data.split('\n')

%%time
for _ in range(10):
    start = time.time()
    for line in lines:
        x = word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)

[出]：

word_tokenize():     4.619681119918823
word_tokenize():     4.666991233825684
word_tokenize():     4.452856779098511
word_tokenize():     4.574898958206177
word_tokenize():     4.536381959915161
word_tokenize():     4.522706031799316
word_tokenize():     4.742286682128906
word_tokenize():     4.894973039627075
word_tokenize():     4.813692808151245
word_tokenize():     4.663335800170898
CPU times: user 44.9 s, sys: 330 ms, total: 45.2 s
Wall time: 46.5 s

使用SFrame

sf = SFrame(data.split('\n'))
sf.materialize() # Reads data fully first

%%time

for _ in range(10):
    start = time.time()
    x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
    print ('word_tokenize() with sframe:\t', time.time() - start)

[出]：

word_tokenize() with sframe:     3.2141151428222656
word_tokenize() with sframe:     3.129708766937256
word_tokenize() with sframe:     3.415634870529175
word_tokenize() with sframe:     3.433109760284424
word_tokenize() with sframe:     3.2390329837799072
word_tokenize() with sframe:     3.236827850341797
word_tokenize() with sframe:     3.3200089931488037
word_tokenize() with sframe:     3.367327928543091
word_tokenize() with sframe:     4.476067066192627
word_tokenize() with sframe:     4.064741134643555
CPU times: user 6.26 s, sys: 471 ms, total: 6.73 s
Wall time: 34.9 s

注意：SFrame是延遲計算的， .materialize()強制.materialize()的持久性到磁盤，並提交所有延遲計算的操作。

與Joblib

另外，您可以使用“非常簡單”的並行化joblib ：

from joblib import Parallel, delayed

%%time
for _ in range(10):
    start = time.time()
    x = Parallel(n_jobs=4)(delayed(word_tokenize)(line) for line in lines)
    print ('word_tokenize() with joblib:\t', time.time() - start)

[出]：

word_tokenize() with joblib:     3.009906053543091
word_tokenize() with joblib:     4.92037296295166
word_tokenize() with joblib:     3.3748512268066406
word_tokenize() with joblib:     3.9530580043792725
word_tokenize() with joblib:     4.794445991516113
word_tokenize() with joblib:     3.7257909774780273
word_tokenize() with joblib:     4.811202049255371
word_tokenize() with joblib:     3.9719762802124023
word_tokenize() with joblib:     4.347040891647339
word_tokenize() with joblib:     3.958757162094116
CPU times: user 5.53 s, sys: 1.35 s, total: 6.88 s
Wall time: 40.9 s

NLTK word_tokenizer的Python多處理-函數永遠不會完成

問題描述

2 個解決方案

解決方案1
3 2016-02-20 06:05:56

已淘汰

解決方案2
0 2019-01-04 01:35:35

在原生Python和NLTK中：

使用SFrame

與Joblib

NLTK word_tokenizer的Python多處理-函數永遠不會完成

問題描述

2 個解決方案

解決方案1 3 2016-02-20 06:05:56

已淘汰

解決方案2 0 2019-01-04 01:35:35

在原生Python和NLTK中：

使用SFrame

與Joblib

解決方案1
3 2016-02-20 06:05:56

解決方案2
0 2019-01-04 01:35:35