[英]Stanford Parser for Python: Output Format
I am currently using the Python interface for the Stanford Parser.我目前正在使用斯坦福解析器的 Python 接口。
from nltk.parse.stanford import StanfordParser
import os
os.environ['STANFORD_PARSER'] ='/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
os.environ['STANFORD_MODELS'] = '/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
new=list(parser.raw_parse("The young man who boarded his usual train that Sunday afternoon was twenty-four years old and fat. "))
print new
The output I get looks something like this:我得到的输出看起来像这样:
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['young']), Tree('NN', ['man'])]), Tree('SBAR', [Tree('WHNP', [Tree('WP', ['who'])]), Tree('S', [Tree('VP', [Tree('VBD', ['boarded']), Tree('NP', [Tree('PRP$', ['his']), Tree('JJ', ['usual']), Tree('NN', ['train'])]), Tree('NP', [Tree('DT', ['that']), Tree('NNP', ['Sunday'])])])])])]), Tree('NP', [Tree('NN', ['afternoon'])]), Tree('VP', [Tree('VBD', ['was']), Tree('NP', [Tree('NP', [Tree('JJ', ['twenty-four']), Tree('NNS', ['years'])]), Tree('ADJP', [Tree('JJ', ['old']), Tree('CC', ['and']), Tree('JJ', ['fat'])])])]), Tree('.', ['.'])])])]
However, I only need the part of speech labels, therefore I'd like to have an output in a format that looks like word/tag.但是,我只需要词性标签,因此我想要一个看起来像 word/tag 格式的输出。
In java it is possible to specify -outputFormat 'wordsAndTags' and it gives exactly what I want.在 java 中可以指定 -outputFormat 'wordsAndTags' 并且它给出了我想要的。 Any hint on how to implement this in Python?关于如何在 Python 中实现这一点的任何提示?
Help would be GREATLY appreciated.非常感谢帮助。 Thanks!谢谢!
PS: Tried to use the Stanford POSTagger but it is by far less accurate on some of the words I'm interested in. PS:尝试使用斯坦福 POSTagger,但它对我感兴趣的某些词的准确度要低得多。
If you look at the NLTK classes for the Stanford parser , you can see that the the raw_parse_sents()
method doesn't send the -outputFormat wordsAndTags
option that you want, and instead sends -outputFormat Penn
.如果您查看斯坦福解析器的 NLTK 类,您会发现raw_parse_sents()
方法不会发送您想要的-outputFormat wordsAndTags
选项,而是发送-outputFormat Penn
。 If you derive your own class from StanfordParser
, you could override this method and specify the wordsAndTags
format.如果您从StanfordParser
派生自己的类,则可以覆盖此方法并指定wordsAndTags
格式。
from nltk.parse.stanford import StanfordParser
class MyParser(StanfordParser):
def raw_parse_sents(self, sentences, verbose=False):
"""
Use StanfordParser to parse multiple sentences. Takes multiple sentences as a
list of strings.
Each sentence will be automatically tokenized and tagged by the Stanford Parser.
The output format is `wordsAndTags`.
:param sentences: Input sentences to parse
:type sentences: list(str)
:rtype: iter(iter(Tree))
"""
cmd = [
self._MAIN_CLASS,
'-model', self.model_path,
'-sentences', 'newline',
'-outputFormat', 'wordsAndTags',
]
return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.