在Anaconda / NLTK中找不到Genia Tagger文件錯誤

Question

我需要使用NLTK執行文本預處理任務，例如句子拆分，標記化和標記。 我想使用GENIA tagger進行標記。 我正在使用Anaconda版本3.10並通過以下命令安裝geniatagger。

python setup.py install

在IPython控制台中，我輸入以下代碼。

import geniatagger
tagger =geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
print tagger.parse('Welcome to natural language processing!')

按Enter鍵時出現以下錯誤消息。

---------------------------------------------------------------------------
WindowsError                              Traceback (most recent call last)
<ipython-input-2-52e4d65c2d02> in <module>()
----> 1 tagger = geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
  2 print tagger.parse('Welcome to natural language processing!')
  3 

 C:\Users\dell\Anaconda\lib\site-packages\geniatagger_python-0.1-py2.7.egg\geniatagger.pyc in __init__(self, path_to_tagger)
 19         self._tagger = subprocess.Popen('./'+os.path.basename(path_to_tagger),
 20                                         cwd=self._dir_to_tagger,
 ---> 21                                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
 22 
 23     def parse(self, text):

 C:\Users\dell\Anaconda\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
708                                 p2cread, p2cwrite,
709                                 c2pread, c2pwrite,
--> 710                                 errread, errwrite)
711         except Exception:
712             # Preserve original exception in case os.close raises.

C:\Users\dell\Anaconda\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
956                                          env,
957                                          cwd,
--> 958                                          startupinfo)
959             except pywintypes.error, e:
960                 # Translate pywintypes.error to WindowsError, which is

WindowsError: [Error 2] The system cannot find the file specified

為什么我收到此錯誤消息？ 我怎樣才能解決這個問題？

如果我立即使用此標記，它是否也會執行標記化部分？

注意：geniatagger python文件位於'geniatagger'文件夾中。

Answer 1

TL; DR ：

# Install Genia Tagger (C code).
$ git clone https://github.com/saffsd/geniatagger && cd geniatagger && make && cd ..
# Install Genia Tagger (python wrapper)
$ git clone https://github.com/informationsea/geniatagger-python.git && cd geniatagger-python && sudo python setup.py install && cd ..
$ python
>>> from geniatagger import GeniaTagger
>>> tagger = GeniaTagger('./geniatagger/geniatagger')
>>> loading morphdic...done.
loading pos_models................done.
loading chunk_models....done.
loading named_entity_models..done.

>>> print tagger.parse('This is a pen.')
[('This', 'This', 'DT', 'B-NP', 'O'), ('is', 'be', 'VBZ', 'B-VP', 'O'), ('a', 'a', 'DT', 'B-NP', 'O'), ('pen', 'pen', 'NN', 'I-NP', 'O'), ('.', '.', '.', 'O', 'O')]

我不確定Genia tagger的軟件包是否可以從conda開箱即用，所以我認為原生的python / pip修復更簡單。

首先，在NLTK中沒有支持Genia Tagger（至少還沒有=）），所以它不是NLTK安裝/模塊的問題。

問題可能在於原始GeniaTagger C代碼使用的一些過時的導入（ http://www.nactem.ac.uk/tsujii/GENIA/tagger/ ）。

所以要解決這個問題，你必須將#include <cstdlib>添加到原始代碼中，但謝天謝地@saffsd已經這樣做了，並把它很好地放在他的github repo中（ https://github.com/saffsd/geniatagger/blob/ master / morph.cpp ）

然后安裝python包裝器，您可以：

從官方pypi pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz ： pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz
或者使用其他一些github repo進行安裝，例如https://github.com/informationsea/geniatagger-python首先出現在google搜索中

最后，python中的GeniaTagger初始化相當奇怪，因為它並沒有真正占用標記器目錄的路徑而是標記器本身，並假設模型文件與標記器位於同一目錄中，請參閱https：// github .com / informationsea / geniatagger-python / blob / master / geniatagger.py＃L19 。

並且可能期望在目錄路徑的第一級使用'./'，因此您必須將標記器初始化為GeniaTagger('./geniatagger/geniatagger') 。

超出安裝問題。 如果你使用GeniaTagger的python包裝器， GeniaTagger對象中只有一個函數，即parse() ，當你使用parse() ，它將為每個句子輸出一個元組列表，輸入是一個句子字符串。 每個元組中的項目是：

令牌（表面字）
引理（見Stemmers vs Lemmatizers ）
POS標簽（看起來像Penn Treebank標簽集，請參閱NLTK的所有可能的pos標簽？）
名詞塊（參見輸出結果為conll格式（POS-tagging，stanford pos tagger））
命名實體塊

在Anaconda / NLTK中找不到Genia Tagger文件錯誤

問題描述

1 個解決方案

解決方案1
3 2015-08-18 17:39:26

在Anaconda / NLTK中找不到Genia Tagger文件錯誤

問題描述

1 個解決方案

解決方案1 3 2015-08-18 17:39:26

解決方案1
3 2015-08-18 17:39:26