简体   繁体   中英

Segmenting sentence into subsentences with CoreNLP

I am working on the following problem: I would like to split sentences into subsentences using Stanford CoreNLP. The example sentence could be:

"Richard is working with CoreNLP, but does not really understand what he is doing"

I would now like my sentence to be split into single "S" as shown in the tree diagram below:

在此处输入图片说明

I would like the output to be a list with the single "S" as follows:

['Richard is working with CoreNLP', ', but', 'does not really understand what', 'he is doing']

I would be really thankful for any help :)

I suspect the tool you're looking for is Tregex , described in more detail in the power point here or the Javadoc of the class itself.

In your case, I believe the pattern you're looking for is simply S . So, something like:

tregex.sh “S” <path_to_file>

where the file is a Penn Treebank formatted tree -- that is, something like (ROOT (S (NP (NNS dogs)) (VP (VB chase) (NP (NNS cats))))) .

As an aside: I believe the fragment " , but " is not actually a sentence, as you've hightlighted in the figure. Rather, the node you've highlighted subsumes the whole sentence " Richard is working with CoreNLP, but does not really understand what he is doing ". Tregex would then print out this whole sentence as one of the matches. Similarly, " does not really understand what " is not a sentence unless it subsumes the entire SBAR: " does not understand what he is doing ".

If you want just the "leaf" sentences (ie, a sentence that's not subsumed by another sentence), you can try a pattern more like:

S !>> S

Note: I haven't tested the patterns -- use at your own risk!

Ok, I found that one do this as follows:

import requests

url = "http://localhost:9000/tregex"
request_params = {"pattern": "S"}
text = "Pusheen and Smitha walked along the beach."
r = requests.post(url, data=text, params=request_params)
print r.json()

Does anybody know how to use other languages (I need German)?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM