POS-Tagger非常慢

Question

I am using nltk to generate n-grams from sentences by first removing given stop words. 我使用nltk从句子中生成n-gram，首先删除给定的停用词。 However, nltk.pos_tag() is extremely slow taking up to 0.6 sec on my CPU (Intel i7). 但是， nltk.pos_tag()非常慢，在我的CPU（Intel i7）上占用0.6秒。

The output: 输出：

['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
0.620481014252
["It's simply the best meal in NYC."]
0.640982151031
['You cannot go wrong at the Red Eye Grill.']
0.644664049149

The code: 代码：

for sentence in source:

    nltk_ngrams = None

    if stop_words is not None:   
        start = time.time()
        sentence_pos = nltk.pos_tag(word_tokenize(sentence))
        print time.time() - start

        filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
    else:
        filtered_words = ngrams(sentence.split(), n)

Is this really that slow or am I doing something wrong here? 这真的很慢还是我在这里做错了什么？

Answer 1

Use pos_tag_sents for tagging multiple sentences: 使用pos_tag_sents标记多个句子：

>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
0.934092998505
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
9.5061340332
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start 
0.939551115036

Answer 2

nltk pos_tag is defined as:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)

so each call to pos_tag instantiates the perceptrontagger module which takes much of the computation time.You can save this time by directly calling tagger.tag yourself as: 因此，对pos_tag的每次调用都会实例化感知器模块，这需要花费大量的计算时间。您可以通过直接调用tagger.tag来节省这段时间：

from nltk.tag.perceptron import PerceptronTagger
tagger=PerceptronTagger()
sentence_pos = tagger.tag(word_tokenize(sentence))

Answer 3

If you are looking for another POS tagger with fast performances in Python, you might want to try RDRPOSTagger . 如果您正在寻找在Python中具有快速性能的另一个POS标记器，您可能想要尝试RDRPOSTagger 。 For example, on English POS tagging, the tagging speed is 8K words/second for a single threaded implementation in Python, using a computer of Core 2Duo 2.4GHz. 例如，在英文POS标记上，使用Core 2Duo 2.4GHz的计算机，Python中的单线程实现的标记速度为8K字/秒。 You can get faster tagging speed by simply using the multi-threaded mode. 只需使用多线程模式即可获得更快的标记速度。 RDRPOSTagger obtains very competitive accuracies in comparison to state-of-the-art taggers and now supports pre-trained models for 40 languages. 与最先进的标记器相比，RDRPOSTagger获得了极具竞争力的精度，现在支持40种语言的预训练模型。 See experimental results in this paper . 见实验结果本文。

POS-Tagger非常慢

问题描述

3 个解决方案

解决方案1
9 已采纳 2015-11-12 16:58:33

解决方案2
5 2016-10-04 07:58:05

解决方案3
0 2015-11-20 07:51:48

POS-Tagger非常慢

问题描述

3 个解决方案

解决方案1 9 已采纳 2015-11-12 16:58:33

解决方案2 5 2016-10-04 07:58:05

解决方案3 0 2015-11-20 07:51:48

解决方案1
9 已采纳 2015-11-12 16:58:33

解决方案2
5 2016-10-04 07:58:05

解决方案3
0 2015-11-20 07:51:48