简体   繁体   中英

Get entities from NLTK.tree result

My original sentence is

Tsunami earthquakes have also been linked to the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface, as is thought to be present in areas of significant topography at the top of the oceanic crust, and where propagation was in an up-dip direction, possibly reaching the seafloor.

I passed the sentence to Stanford NLP and got the nice parse tree:

(ROOT (S (NP (NN Tsunami) (NNS earthquakes)) (VP (VBP have) (ADVP (RB also)) (VP (VBN been) (VP (VBN linked) (PP (TO to) (NP (NP (DT the) (NN presence)) (PP (IN of) (NP (NP (DT a) (JJ thin) (NN layer)) (PP (IN of) (S (VP (VBN subducted) (NP (NP (JJ sedimentary) (NN rock)) (PP (IN along) (NP (NP (NP (DT the) (JJS uppermost) (NN part)) (PP (IN of) (NP (DT the) (NN plate) (NN interface)))) (, ,) (UCP (RB as) (S (VP (VBZ is) (VP (VBN thought) (S (VP (TO to) (VP (VB be) (ADJP (JJ present) (PP (IN in) (NP (NP (NNS areas)) (PP (IN of) (NP (JJ significant) (NN topography)))))) (PP (IN at) (NP (NP (DT the) (NN top)) (PP (IN of) (NP (DT the) (JJ oceanic) (NN crust))))))))))) (, ,) (CC and) (SBAR (WHADVP (WRB where)) (S (NP (NN propagation)) (VP (VBD was) (PP (IN in) (NP (DT an) (JJ up-dip) (NN direction))) (, ,) (ADVP (RB possibly))))))))) (S (VP (VBG reaching) (NP (DT the) (NN seafloor)))))))))))))) (. .)))

Then I feed above string to NLTK.Tree:

pasrsd_tree = NLTK.Tree.fromstring(parsetree_string)

The result is quite nice:

Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tr ee('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])

My question is, given the pared_tree, how can I get the left level entity like top of the oceanic crust , a thin layer ?

I am thinking the levels of the parsed tree can be useful, but I really lost when looking at the tree level and don't how to do.

I am mainly Python based, Stanford NLP result is obtained using a Python wrapper( https://bitbucket.org/torotoki/corenlp-python ).

Could anyone help me and maybe point out some directions?

You can try extracting subtrees that are labelled NP :

>>> from nltk import Tree
>>> parsed_tree = Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])

>>> np = [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP']
>>> np
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'areas', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'propagation', 'an up-dip direction', 'the seafloor']

But that results in lots of noise, so let's say no single word is a phrase:

>>> np_mwe
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

Still quite noisy, let's say a noun phrase should not contain comma (not necessary true but useful trick):

>>> np_mwe_nocomma = [j for j in [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP'] if j.count(' ') > 0 and j.count(',') == 0]
>>> np_mwe_nocomma
['Tsunami earthquakes', 'the presence', 'a thin layer', 'sedimentary rock', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

Now we easily see subtrees in subtrees, so let's choose to take the bigger subtree:

>> x = []
>>> for i in sorted(np_mwe_nocomma, key=len, reverse=True):
...     for j in x:
...             if i in j:
...                     continue
...     print i
...     x.append(i)
... 
the uppermost part of the plate interface
areas of significant topography
the top of the oceanic crust
significant topography
Tsunami earthquakes
the plate interface
an up-dip direction
the uppermost part
the oceanic crust
sedimentary rock
the presence
a thin layer
the seafloor

I'm not sure whether this gives you what you need but your definition of "entities" needs to be more specific otherwise almost any NP tagged by the parser can be an "entity"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM