[英]Python Multiprocessing of NLTK word_tokenizer - function never completes
我正在使用NLTK在一些相當大的數據集上執行自然語言處理,並希望利用我所有的處理器內核。 似乎我正在使用多處理模塊,當我運行以下測試代碼時,我看到所有內核都在使用,但是代碼從未完成。
在不進行多處理的情況下,執行同一任務大約需要一分鍾。
Debian上的Python 2.7.11。
from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp
def open_file(filepath):
#open and parse file
file = io.open(filepath, 'rU', encoding='utf-8')
text = file.read()
return text
def mp_word_tokenize(text_to_process):
#word tokenize
start_time = time.clock()
pool = mp.Pool(processes=8)
word_tokens = pool.map(word_tokenize, text_to_process)
finish_time = time.clock() - start_time
print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
return word_tokens
filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)
這個答案已經過時了。 請改為查看https://stackoverflow.com/a/54032108/610569
這是使用sframe
進行多線程的騙子方法:
>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>>
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>>
>>> for _ in range(10):
... start = time.time()
... for line in data.split('\n'):
... x = word_tokenize(line)
... print ('word_tokenize():\t', time.time() - start)
...
word_tokenize(): 4.058445692062378
word_tokenize(): 4.05820369720459
word_tokenize(): 4.090051174163818
word_tokenize(): 4.210559129714966
word_tokenize(): 4.17473030090332
word_tokenize(): 4.105806589126587
word_tokenize(): 4.082665681838989
word_tokenize(): 4.13646936416626
word_tokenize(): 4.185062408447266
word_tokenize(): 4.085020065307617
>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
... start = time.time()
... x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
... print ('word_tokenize() with sframe:\t', time.time() - start)
...
word_tokenize() with sframe: 7.174573659896851
word_tokenize() with sframe: 5.072867393493652
word_tokenize() with sframe: 5.129574775695801
word_tokenize() with sframe: 5.10952091217041
word_tokenize() with sframe: 5.015898942947388
word_tokenize() with sframe: 5.037845611572266
word_tokenize() with sframe: 5.015375852584839
word_tokenize() with sframe: 5.016635894775391
word_tokenize() with sframe: 5.155989170074463
word_tokenize() with sframe: 5.132697105407715
>>> for _ in range(10):
... start = time.time()
... x = [word_tokenize(line) for line in data.split('\n')]
... print ('str.split():\t', time.time() - start)
...
str.split(): 4.176181793212891
str.split(): 4.116339921951294
str.split(): 4.1104896068573
str.split(): 4.140819549560547
str.split(): 4.103625774383545
str.split(): 4.125757694244385
str.split(): 4.10755729675293
str.split(): 4.177418947219849
str.split(): 4.11145281791687
str.split(): 4.140623092651367
請注意,速度差異可能是因為我在其他內核上還有其他運行。 但是,有了更大的數據集和專用內核,您真的可以看到這種規模。
已經有兩年了, SFrame
已經發展成為turicreate
一部分:
而且,使用新的SFrame
(在Python3中)可以提高速度。
from nltk import word_tokenize
from turicreate import SFrame
import time
from nltk import word_tokenize
import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
lines = data.split('\n')
%%time
for _ in range(10):
start = time.time()
for line in lines:
x = word_tokenize(line)
print ('word_tokenize():\t', time.time() - start)
[出]:
word_tokenize(): 4.619681119918823
word_tokenize(): 4.666991233825684
word_tokenize(): 4.452856779098511
word_tokenize(): 4.574898958206177
word_tokenize(): 4.536381959915161
word_tokenize(): 4.522706031799316
word_tokenize(): 4.742286682128906
word_tokenize(): 4.894973039627075
word_tokenize(): 4.813692808151245
word_tokenize(): 4.663335800170898
CPU times: user 44.9 s, sys: 330 ms, total: 45.2 s
Wall time: 46.5 s
sf = SFrame(data.split('\n'))
sf.materialize() # Reads data fully first
%%time
for _ in range(10):
start = time.time()
x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
print ('word_tokenize() with sframe:\t', time.time() - start)
[出]:
word_tokenize() with sframe: 3.2141151428222656
word_tokenize() with sframe: 3.129708766937256
word_tokenize() with sframe: 3.415634870529175
word_tokenize() with sframe: 3.433109760284424
word_tokenize() with sframe: 3.2390329837799072
word_tokenize() with sframe: 3.236827850341797
word_tokenize() with sframe: 3.3200089931488037
word_tokenize() with sframe: 3.367327928543091
word_tokenize() with sframe: 4.476067066192627
word_tokenize() with sframe: 4.064741134643555
CPU times: user 6.26 s, sys: 471 ms, total: 6.73 s
Wall time: 34.9 s
注意:SFrame是延遲計算的, .materialize()
強制.materialize()
的持久性到磁盤,並提交所有延遲計算的操作。
另外,您可以使用“非常簡單”的並行化joblib
:
from joblib import Parallel, delayed
%%time
for _ in range(10):
start = time.time()
x = Parallel(n_jobs=4)(delayed(word_tokenize)(line) for line in lines)
print ('word_tokenize() with joblib:\t', time.time() - start)
[出]:
word_tokenize() with joblib: 3.009906053543091
word_tokenize() with joblib: 4.92037296295166
word_tokenize() with joblib: 3.3748512268066406
word_tokenize() with joblib: 3.9530580043792725
word_tokenize() with joblib: 4.794445991516113
word_tokenize() with joblib: 3.7257909774780273
word_tokenize() with joblib: 4.811202049255371
word_tokenize() with joblib: 3.9719762802124023
word_tokenize() with joblib: 4.347040891647339
word_tokenize() with joblib: 3.958757162094116
CPU times: user 5.53 s, sys: 1.35 s, total: 6.88 s
Wall time: 40.9 s
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.