簡體   English   中英

POS-Tagger非常慢

[英]POS-Tagger is incredibly slow

我使用nltk從句子中生成n-gram,首先刪除給定的停用詞。 但是, nltk.pos_tag()非常慢,在我的CPU(Intel i7)上占用0.6秒。

輸出:

['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
0.620481014252
["It's simply the best meal in NYC."]
0.640982151031
['You cannot go wrong at the Red Eye Grill.']
0.644664049149

代碼:

for sentence in source:

    nltk_ngrams = None

    if stop_words is not None:   
        start = time.time()
        sentence_pos = nltk.pos_tag(word_tokenize(sentence))
        print time.time() - start

        filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
    else:
        filtered_words = ngrams(sentence.split(), n)

這真的很慢還是我在這里做錯了什么?

使用pos_tag_sents標記多個句子:

>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
0.934092998505
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
9.5061340332
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start 
0.939551115036
nltk pos_tag is defined as:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)

因此,對pos_tag的每次調用都會實例化感知器模塊,這需要花費大量的計算時間。您可以通過直接調用tagger.tag來節省這段時間:

from nltk.tag.perceptron import PerceptronTagger
tagger=PerceptronTagger()
sentence_pos = tagger.tag(word_tokenize(sentence))

如果您正在尋找在Python中具有快速性能的另一個POS標記器,您可能想要嘗試RDRPOSTagger 例如,在英文POS標記上,使用Core 2Duo 2.4GHz的計算機,Python中的單線程實現的標記速度為8K字/秒。 只需使用多線程模式即可獲得更快的標記速度。 與最先進的標記器相比,RDRPOSTagger獲得了極具競爭力的精度,現在支持40種語言的預訓練模型。 見實驗結果本文

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM