如何在NLTK中使用stanford word tokenizer？

Question

我正在尋找在nltk中使用stanford字標記器的方法，我想使用，因為當我比較stanford和nltk字標記器的結果時，它們都是不同的。 我知道可能有辦法使用stanford tokenizer，就像我們可以在NLTK中支持POS Tagger和NER一樣。

是否可以在不運行服務器的情況下使用stanford tokenizer？

謝謝

Answer 1

注意：此解決方案僅適用於：

NLTK v3.2.5（v3.2.6將具有更簡單的界面）
Stanford CoreNLP（版本> = 2016-10-31）

首先，您必須首先正確安裝Java 8，如果Stanford CoreNLP在命令行上運行，NLTK v3.2.5中的Stanford CoreNLP API如下所示。

注意：您必須在終端中使用NLTK中的新CoreNLP API啟動CoreNLP服務器。

在終端上：

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000

在Python中：

>>> from nltk.parse.corenlp import CoreNLPParser
>>> st = CoreNLPParser()
>>> tokenized_sent = list(st.tokenize('What is the airspeed of an unladen swallow ?'))
>>> tokenized_sent
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']

Answer 2

在NLTK之外，您可以使用Stanford NLP最近發布的官方 Python界面：

安裝

cd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
pip3 install -U https://github.com/stanfordnlp/python-stanford-corenlp/archive/master.zip

設置環境

# On Mac
export CORENLP_HOME=/Users/<username>/stanford-corenlp-full-2016-10-31/

# On linux
export CORENLP_HOME=/home/<username>/stanford-corenlp-full-2016-10-31/

在Python中

>>> import corenlp
>>> with corenlp.client.CoreNLPClient(annotators="tokenize ssplit".split()) as client:
...     ann = client.annotate(text)
... 
[pool-1-thread-4] INFO CoreNLP - [/0:0:0:0:0:0:0:1:55475] API call w/annotators tokenize,ssplit
Chris wrote a simple sentence that he parsed with Stanford CoreNLP.
>>> sentence = ann.sentence[0]
>>> 
>>> [token.word for token in sentence.token]
['Chris', 'wrote', 'a', 'simple', 'sentence', 'that', 'he', 'parsed', 'with', 'Stanford', 'CoreNLP', '.']

如何在NLTK中使用stanford word tokenizer？

問題描述

2 個解決方案

解決方案1
10 2017-12-04 05:33:33

解決方案2
2 已采納 2017-12-04 05:44:48

安裝

設置環境

在Python中

如何在NLTK中使用stanford word tokenizer？

問題描述

2 個解決方案

解決方案1 10 2017-12-04 05:33:33

解決方案2 2 已采納 2017-12-04 05:44:48

安裝

設置環境

在Python中

解決方案1
10 2017-12-04 05:33:33

解決方案2
2 已采納 2017-12-04 05:44:48