在Anaconda / NLTK中找不到Genia Tagger文件错误

Question

I need to perform text pre-processing tasks such as sentence splitting, tokenization and tagging using NLTK. 我需要使用NLTK执行文本预处理任务，例如句子拆分，标记化和标记。 I want to use GENIA tagger for tagging. 我想使用GENIA tagger进行标记。 I am using Anaconda version 3.10 and installed geniatagger by the following command. 我正在使用Anaconda版本3.10并通过以下命令安装geniatagger。

python setup.py install

In the IPython console, the following I entered the following code. 在IPython控制台中，我输入以下代码。

import geniatagger
tagger =geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
print tagger.parse('Welcome to natural language processing!')

The following error message appears when pressed Enter. 按Enter键时出现以下错误消息。

---------------------------------------------------------------------------
WindowsError                              Traceback (most recent call last)
<ipython-input-2-52e4d65c2d02> in <module>()
----> 1 tagger = geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
  2 print tagger.parse('Welcome to natural language processing!')
  3 

 C:\Users\dell\Anaconda\lib\site-packages\geniatagger_python-0.1-py2.7.egg\geniatagger.pyc in __init__(self, path_to_tagger)
 19         self._tagger = subprocess.Popen('./'+os.path.basename(path_to_tagger),
 20                                         cwd=self._dir_to_tagger,
 ---> 21                                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
 22 
 23     def parse(self, text):

 C:\Users\dell\Anaconda\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
708                                 p2cread, p2cwrite,
709                                 c2pread, c2pwrite,
--> 710                                 errread, errwrite)
711         except Exception:
712             # Preserve original exception in case os.close raises.

C:\Users\dell\Anaconda\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
956                                          env,
957                                          cwd,
--> 958                                          startupinfo)
959             except pywintypes.error, e:
960                 # Translate pywintypes.error to WindowsError, which is

WindowsError: [Error 2] The system cannot find the file specified

Why do I get this error message? 为什么我收到此错误消息？ How can I fix this? 我怎样才能解决这个问题？

If I use this tagging straight away, will it perform the tokenization part as well? 如果我立即使用此标记，它是否也会执行标记化部分？

Note: geniatagger python file is inside the 'geniatagger' folder. 注意：geniatagger python文件位于'geniatagger'文件夹中。

Answer 1

TL;DR : TL; DR ：

# Install Genia Tagger (C code).
$ git clone https://github.com/saffsd/geniatagger && cd geniatagger && make && cd ..
# Install Genia Tagger (python wrapper)
$ git clone https://github.com/informationsea/geniatagger-python.git && cd geniatagger-python && sudo python setup.py install && cd ..
$ python
>>> from geniatagger import GeniaTagger
>>> tagger = GeniaTagger('./geniatagger/geniatagger')
>>> loading morphdic...done.
loading pos_models................done.
loading chunk_models....done.
loading named_entity_models..done.

>>> print tagger.parse('This is a pen.')
[('This', 'This', 'DT', 'B-NP', 'O'), ('is', 'be', 'VBZ', 'B-VP', 'O'), ('a', 'a', 'DT', 'B-NP', 'O'), ('pen', 'pen', 'NN', 'I-NP', 'O'), ('.', '.', '.', 'O', 'O')]

I'm not sure whether the packages for Genia tagger works out of the box from conda , so i think a native python/pip fix is simpler. 我不确定Genia tagger的软件包是否可以从conda开箱即用，所以我认为原生的python / pip修复更简单。

Firstly, there's no support for Genia Tagger in NLTK (At least not yet =) ), so it isn't a problem with the NLTK installation/modules. 首先，在NLTK中没有支持Genia Tagger（至少还没有=）），所以它不是NLTK安装/模块的问题。

The problem might lie in some outdated imports that the original GeniaTagger C code uses ( http://www.nactem.ac.uk/tsujii/GENIA/tagger/ ). 问题可能在于原始GeniaTagger C代码使用的一些过时的导入（ http://www.nactem.ac.uk/tsujii/GENIA/tagger/ ）。

So to resolve the problem, you have to add #include <cstdlib> to the original code but thankfully @saffsd has already done so and put it nicely in his github repo ( https://github.com/saffsd/geniatagger/blob/master/morph.cpp ) 所以要解决这个问题，你必须将#include <cstdlib>添加到原始代码中，但谢天谢地@saffsd已经这样做了，并把它很好地放在他的github repo中（ https://github.com/saffsd/geniatagger/blob/ master / morph.cpp ）

Then comes installing the python wrapper, you can either: 然后安装python包装器，您可以：

install from the official pypi with: pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz 从官方pypi pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz ： pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz
or use some other github repo to install, eg https://github.com/informationsea/geniatagger-python that appears first from google search 或者使用其他一些github repo进行安装，例如https://github.com/informationsea/geniatagger-python首先出现在google搜索中

Lastly, the GeniaTagger initialization in python is rather weird because it doesn't really take the path to the directory of the tagger but the tagger itself and assumes that the model files are in the same directory as the tagger, see https://github.com/informationsea/geniatagger-python/blob/master/geniatagger.py#L19 . 最后，python中的GeniaTagger初始化相当奇怪，因为它并没有真正占用标记器目录的路径而是标记器本身，并假设模型文件与标记器位于同一目录中，请参阅https：// github .com / informationsea / geniatagger-python / blob / master / geniatagger.py＃L19 。

And possibly it expects some use of './' in the first level of directory path, so you would have to initialize the tagger as such GeniaTagger('./geniatagger/geniatagger') . 并且可能期望在目录路径的第一级使用'./'，因此您必须将标记器初始化为GeniaTagger('./geniatagger/geniatagger') 。

Beyond the installation issues. 超出安装问题。 If you use the python wrapper for the GeniaTagger, there's only one function in the GeniaTagger object, ie parse() , when you use parse() , it will output a list of tuples for each sentence and the input is one sentence string. 如果你使用GeniaTagger的python包装器， GeniaTagger对象中只有一个函数，即parse() ，当你使用parse() ，它将为每个句子输出一个元组列表，输入是一个句子字符串。 The items in each tuple are: 每个元组中的项目是：

token (surface word) 令牌（表面字）
lemma (see Stemmers vs Lemmatizers ) 引理（见Stemmers vs Lemmatizers ）
POS tag (looks like Penn Treebank tagset, see What are all possible pos tags of NLTK? ) POS标签（看起来像Penn Treebank标签集，请参阅NLTK的所有可能的pos标签？）
Noun chunk (see Output results in conll format (POS-tagging, stanford pos tagger) ) 名词块（参见输出结果为conll格式（POS-tagging，stanford pos tagger））
Named Entity chunk 命名实体块

在Anaconda / NLTK中找不到Genia Tagger文件错误

问题描述

1 个解决方案

解决方案1
3 2015-08-18 17:39:26

在Anaconda / NLTK中找不到Genia Tagger文件错误

问题描述

1 个解决方案

解决方案1 3 2015-08-18 17:39:26

解决方案1
3 2015-08-18 17:39:26