简体   繁体   English

使用 CoreNLP 将句子分割成子句

[英]Segmenting sentence into subsentences with CoreNLP

I am working on the following problem: I would like to split sentences into subsentences using Stanford CoreNLP.我正在解决以下问题:我想使用斯坦福 CoreNLP 将句子分成子句。 The example sentence could be:例句可以是:

"Richard is working with CoreNLP, but does not really understand what he is doing"

I would now like my sentence to be split into single "S" as shown in the tree diagram below:我现在希望将我的句子拆分为单个“S”,如下面的树形图所示:

在此处输入图片说明

I would like the output to be a list with the single "S" as follows:我希望输出是一个带有单个“S”的列表,如下所示:

['Richard is working with CoreNLP', ', but', 'does not really understand what', 'he is doing']

I would be really thankful for any help :)我真的很感激任何帮助:)

I suspect the tool you're looking for is Tregex , described in more detail in the power point here or the Javadoc of the class itself.我怀疑您正在寻找的工具是Tregex ,在此处的电源点或类本身的Javadoc中有更详细的描述。

In your case, I believe the pattern you're looking for is simply S .在您的情况下,我相信您正在寻找的模式只是S So, something like:所以,像这样:

tregex.sh “S” <path_to_file>

where the file is a Penn Treebank formatted tree -- that is, something like (ROOT (S (NP (NNS dogs)) (VP (VB chase) (NP (NNS cats))))) .其中文件是 Penn Treebank 格式的树——也就是说,类似于(ROOT (S (NP (NNS dogs)) (VP (VB chase) (NP (NNS cats)))))

As an aside: I believe the fragment " , but " is not actually a sentence, as you've hightlighted in the figure.顺便说一句:我相信片段“ ,但是”实际上并不是一个句子,正如您在图中突出显示的那样。 Rather, the node you've highlighted subsumes the whole sentence " Richard is working with CoreNLP, but does not really understand what he is doing ".相反,您突出显示的节点包含了整个句子“ Richard 正在使用 CoreNLP,但并不真正理解他在做什么”。 Tregex would then print out this whole sentence as one of the matches.然后,Tregex 会将整个句子打印为匹配项之一。 Similarly, " does not really understand what " is not a sentence unless it subsumes the entire SBAR: " does not understand what he is doing ".同样,“并不真正理解什么”不是一个句子,除非它包含整个 SBAR:“不明白他在做什么”。

If you want just the "leaf" sentences (ie, a sentence that's not subsumed by another sentence), you can try a pattern more like:如果你只想要“叶子”句子(即一个没有被另一个句子包含的句子),你可以尝试更像这样的模式:

S !>> S

Note: I haven't tested the patterns -- use at your own risk!注意:我还没有测试这些模式——使用风险自负!

Ok, I found that one do this as follows:好的,我发现有人这样做:

import requests

url = "http://localhost:9000/tregex"
request_params = {"pattern": "S"}
text = "Pusheen and Smitha walked along the beach."
r = requests.post(url, data=text, params=request_params)
print r.json()

Does anybody know how to use other languages (I need German)?有人知道如何使用其他语言吗(我需要德语)?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM