简体   繁体   English

使用Python中的Stanford Parser中文文本不起作用

[英]Using The Stanford Parser in Python on Chinese text not working

I tried this code, but it didn't work: 我试过这段代码,但它不起作用:

# -*- coding:utf-8 -*-
from nltk.parse import stanford
s = '你好'.decode('utf-8')

print s
parser = stanford.StanfordParser(path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')
print parser.raw_parse_sents(s)

The result prints 你 and 好 as two words: 结果将你和好打成两个字:

你好
[Tree('ROOT', [Tree('NP', [Tree('NNP', ['\u4f60'])])]), Tree('ROOT', [Tree('NP', [Tree('NNP', ['\u597d'])])])]

but on the online parser( http://nlp.stanford.edu:8080/parser/index.jsp ), the result is 但是在在线解析器( http://nlp.stanford.edu:8080/parser/index.jsp )上,结果是

Parse (ROOT (IP (VP (VV 你好)))) 解析(ROOT(IP(VP(VV你好))))

How do I fix my code to produce the same result as the online parser? 如何修复我的代码以产生与在线解析器相同的结果?

There are two (ok, three... see "Update 3" below for the third) separate things going on: 有两个(好的,三个......见下面的“更新3”,第三个)单独的事情:

1) Your code is returning two trees (two ROOT s), but you only expect to get one. 1)你的代码返回两棵树(两个ROOT ),但你只希望得到一棵树。 This is happening because raw_parse_sents expects a list of sentences, not a single sentence, and if you give it a string, it is parsing each character in the string as if it were its own sentence and returning a list of one-character trees. 发生这种情况是因为raw_parse_sents需要一个句子列表,而不是单个句子,如果你给它一个字符串,它会解析字符串中的每个字符,好像它是它自己的句子并返回一个单字符树的列表。 So either pass raw_parse_sents a list, or use raw_parse instead. 因此要么传递raw_parse_sents列表,要么使用raw_parse

2) You haven't specified a model_path , and the default is English. 2)您尚未指定model_path ,默认为英语。 There are five options for Chinese, but it looks like this one matches the online parser: 中文有五个选项,但看起来这个与在线解析器相匹配:

parser = stanford.StanfordParser(model_path='edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz', path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')

Combining these two changes, I am able to match the online parser (I also had to cast the returned listiterator to a list in order to match your output format): 结合这两个变化,我能够匹配在线解析器(我还必须将返回的listiterator强制转换为列表以匹配您的输出格式):

from nltk.parse import stanford
s = '你好'.decode('utf-8')

print s.encode('utf-8')
parser = stanford.StanfordParser(model_path='edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz', path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')
print list(parser.raw_parse(s))

> 你好
> [Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u4f60\u597d'])])])])]

Update 1: 更新1:

I realized you might be looking for an output format more like the one on the website as well, in which case this works: 我意识到你可能正在寻找更像网站上的输出格式的输出格式,在这种情况下,它可以工作:

for tree in parser.raw_parse(s):
    print tree # or print tree.pformat().encode('utf-8') to force an encoding

Update 2: 更新2:

Apparently if your version of NLTK is earlier than 3.0.2, Tree.pformat() was Tree.pprint() . 显然,如果你的NLTK版本早于3.0.2, Tree.pformat()就是Tree.pprint() From https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0 : 来自https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

Printing changes (from 3.0.2, see https://github.com/nltk/nltk/issues/804 ): 打印更改(从3.0.2开始,请参阅https://github.com/nltk/nltk/issues/804 ):

  • classify.decisiontree.DecisionTreeClassifier.pp → pretty_format classify.decisiontree.DecisionTreeClassifier.pp→pretty_format
  • metrics.confusionmatrix.ConfusionMatrix.pp → pretty_format metrics.confusionmatrix.ConfusionMatrix.pp→pretty_format
  • sem.lfg.FStructure.pprint → pretty_format sem.lfg.FStructure.pprint→pretty_format
  • sem.drt.DrtExpression.pretty → pretty_format sem.drt.DrtExpression.pretty→pretty_format
  • parse.chart.Chart.pp → pretty_format parse.chart.Chart.pp→pretty_format
  • Tree.pprint() → pformat Tree.pprint()→pformat
  • FreqDist.pprint → pformat FreqDist.pprint→pformat
  • Tree.pretty_print → pprint Tree.pretty_print→pprint
  • Tree.pprint_latex_qtree → pformat_latex_qtree Tree.pprint_latex_qtree→pformat_latex_qtree

Update 3: 更新3:

I am now trying to match the output for the sentence in your comment, '你好,我心情不错今天,你呢?' . 我现在正在尝试匹配你评论中句子的输出, '你好,我心情不错今天,你呢?'

I referred to the Stanford Parser FAQ extensively while writing this response and suggest you check it out (especially "Can you give me some help in getting started parsing Chinese?"). 我在撰写此回复时广泛提到了Stanford Parser FAQ ,并建议您查看它(特别是“你能帮我开始解析中文吗?”)。 Here's what I've learned: 这是我学到的东西:

In general, you need to "segment" Chinese text into words (consisting of one or more characters) separated by spaces before parsing it. 通常,在解析之前,您需要将中文文本“分段”为由空格分隔的单词(由一个或多个字符组成)。 The online parser does this, and you can see the output of both the segmentation step and the parsing step on the web page. 在线解析器执行此操作,您可以在网页上看到分段步骤和解析步骤的输出。 For our test sentence, the segmentation it shows is '你好 , 我 心情 不错 今天 , 你 呢 ?' . 对于我们的测试句子,它显示的分段是'你好 , 我 心情 不错 今天 , 你 呢 ?'

If I run this segmentation string through the xinhuaFactored model locally, my output matches the online parser exactly. 如果我在本地通过xinhuaFactored模型运行此分段字符串,我的输出将完全匹配在线分析器。

So we need to run our text through a word segmenter before running it through the parser. 所以我们需要在通过解析器运行之前通过单词分段器运行我们的文本。 The FAQ recommends the Stanford Word Segmenter, which is probably what the online parser is using anyway: http://nlp.stanford.edu/software/segmenter.shtml . FAQ推荐使用Stanford Word Segmenter,这可能是在线解析器正在使用的内容: http//nlp.stanford.edu/software/segmenter.shtml

(As the FAQ mentions, the parser also contains a model xinhuaFactoredSegmenting which does an approximate segmentation as part of the parsing call. However, it calls this method "reasonable, but not excellent", and the output doesn't match the online parser anyway, which is our standard.) (正如FAQ所提到的,解析器还包含一个模型xinhuaFactoredSegmenting ,它作为解析调用的一部分进行近似分割。但是,它调用此方法“合理,但不是很好”,并且输出与在线解析器无关,这是我们的标准。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM