简体   繁体   中英

How to extract the primary subject and object phrases from a complex sentence?

In the documentation for the Stanford Parser, the following example sentence is given:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

This produces the parse tree:

[ROOT [S [S [NP [NP [DT The] [JJS strongest] [NN rain] ] [VP [ADVP [RB ever] ] [VBN recorded][PP [IN in] [NP [NNP India] ] ] ] ] [VP [VP [VBD shut] [PRT [RP down] ] [NP [NP [DT the] [JJ financial] [NN hub] ] [PP [IN of] [NP [NNP Mumbai] ] ] ] ] [, ,] [VP [VBD snapped] [NP [NN communication] [NNS lines] ] ] [, ,] [VP [VBD closed] [NP [NNS airports] ] ] [CC and] [VP [VBD forced] [NP [NP [NNS thousands] ] [PP [IN of] [NP [NNS people] ] ] ] [S [VP [TO to] [VP [VP [VB sleep] [PP [IN in] [NP [PRP$ their] [NNS offices] ] ] ] [CC or] [VP [VB walk] [NP [NN home] ] [PP [IN during] [NP [DT the] [NN night] ] ] ] ] ] ] ] ] ] [, ,] [NP [NNS officials] ] [VP [VBD said] [NP-TMP [NN today] ] ] [. .] ] ]

(see http://i.imgur.com/mZLBDmh.png ).

What sort of NLP tool would be able to output the sentential subject and object from the above complex sentence example? Desired output:

sentence_subj_phrase = "the strongest rain ever recorded in India"
sentence_obj_phrase = "the financial hub of Mumbai"

FROM ORIGINAL OP's POST (It's just details about what he's thinks doesn't work):

A naive way of extracting the subject and object in a sentence is to find the noun phrases immediately preceding and succeeding the verb. In complex sentences, however, there are multiple verbs, and thus multiple subjects and objects. It is possible to consider complex sentences like this as multiple sentences (using the first part of the independent clause as the "root", and replacing the second part with each of the dependent clauses), but usually the first clause is the most important and could be considered the main "topic" of the sentence.

Doing a simple BFS to find the first NP prior to a verb will result in "officials" being the subject, since it is at the lowest depth level. This doesn't capture the intuition of the first clause containing the subject. One approach I tried was searching for the NPs in the first "base" S node (ie, lowest level subtree rooted at an S node), but in this case that would capture nodes rooted at S 3 .

You seem to be mixing up the notions of topic and grammatical subject to some extent. "officials" is a perfectly good grmmatical subject of "said". As you sort of explain, you should think about finding subjects of clauses ("S" subtrees in the tree) rather than subjects of sentences. "the strongest rain..." is the grammatical subject of S_2 in your example.

If all you want is the first grammatical subject in any clause in the sentence, find all subjects in all S subtrees using whatever algorithm you've chosen (the NP in an S->NP VP subtree, etc.) and then pick the one that's furthest to the left in the whole tree. (This obviously won't necessarily find a phrase that's a good topic, though.)

Some points to take note, when you talk about grammatical subjects and objects, they are following structuralist theory of linguistics, which most NLP tasks adhere to.

Next when you talk about grammatical subjects and object, you should only refer to the entity (ie the thing/event) itself and that excludes the entity modifiers: "the strongest rain ever recorded in India"

entity = "rain"
entity modifiers = [('Adjective/Preposition_Phrase', "ever recorded in India"), ("Determiners", "the"), (Adjective_Phrase, "strongest")]
entity phrase = "The strongest rain"
entity phrase with all posssible modifiers (EP_mod)= "the strongest rain ever recorded in India"

Then we come down to the NLP task of how to detect EP_mod :

  1. First, you can try to figure out an algorithm that determines the primary predicate (ie verb in shallow computational grammar) in the complex sentence. (I suggest, find the verb in the top most hierarchy of the parse tree)

  2. Then, you need to find the phrase that contains the SUBJ/OBJ entity of the primary predicate. (Any normal NLP parser should tell you this)

  3. Lastly, you need to find the modifiers of the phrase that contains the SUBJ/OBJ entity of the primary predicate (Possibly you need to find a dependency parser (Stanford parser is a dependency parser) that gives you the annotation that says SUBJ_phrase governs Modifier_phrase )

What you're asking for is a mish-mash of current existing tools, so the best solution is the eat your own dog food solution. Have fun with it =)

Here is a Python Spacy method:

Code

from spacy.en import English
nlp = English()


SUBJECTS = ["nsubj","nsubjpass"] ## add or delete more as you wish
OBJECTS = ["dobj", "pobj", "dobj"] ## add or delete more as you wish


sent = "The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today."

doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ in SUBJECTS) ]
obj_toks = [tok for tok in doc if (tok.dep_ in OBJECTS) ]

print("Subjects:", sub_toks)
print("Objects :", obj_toks)

Result

Subjects: [rain, officials]
Objects : [India, hub, Mumbai, lines, thousands, people, offices, night]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM