簡體   English   中英

從NLTK.tree結果中獲取實體

[英]Get entities from NLTK.tree result

我原來的句子是

海嘯地震還與沿板塊界面最上部存在俯沖沉積岩薄層有關,據認為這發生在大洋地殼頂部的重要地形區域中,並且傳播在可能會到達海底的上傾方向。

我將句子傳遞給了Stanford NLP,並得到了不錯的解析樹:

(根(S(NP(NN海嘯)(NNS地震))(VP(VBP有)(ADVP(RB也))(VP(VBN被)(VP(VBN鏈接))(PP(TO至)(NP(NP (DT的)(NN存在))(PP(IN的)(NP(NP(DT a)(JJ薄)(NN層)))(PP(IN的)(S(VP(VBN扣除)(NP(NP (JJ沉積)(NN岩石))(PP(沿IN)(NP(NP(NP(NP(DT))(JJS最上層)(NN部分)))(PP(IN of)(NP(DT the)(NN板) (NN接口)))(,,)(UCP(RB as)(S(VP(VBZ is)(VP(VBN思想))(S(VP(TO to))(VP(VB be))(ADJP(JJ目前)(PP(IN in)(NP(NP(NNS區域))(PP(IN of)(NP(JJ有意義)(NN地形))))))(PP(IN at)(NP(NP(DT )(NN頂部))(PP(IN)的(NP(DT的)(JJ大洋)(NN地殼)))))))))))(,,)(CC和)(SBAR(WHADVP(WRB其中))(S(NP(NN傳播))(VP(VBD是)(PP(IN輸入)(NP(DT an)(JJ上傾)(NN方向))))(,,)(ADVP(RB可能)))))))))(S(VP(到達VBG)(NP(DT the)(NN海底)))))))))))))))))))))()()

然后我將上面的字符串提供給NLTK.Tree:

pasrsd_tree = NLTK.Tree.fromstring(parsetree_string)

結果是相當不錯的:

樹('ROOT',[樹('S',[樹('NP',[樹('NN',['海嘯'])),樹('NNS',['地震'])])),樹('VP',[Tree('VBP',['have']),Tree('ADVP',[Tree('RB',['also']))]),Tree('VP',[Tree( 'VBN',['been']),Tree('VP',[Tree('VBN',['linked'])),Tree('PP',[Tree('TO',['to'])) ,Tree('NP',[Tree('NP',[Tree('DT',['the']),Tree('NN',['presence']))]),Tree('PP',[ Tree('IN',['of']),Tree('NP',[Tree('NP',[Tree('DT',['a']]),Tree('JJ',['thin' ]),Tree('NN',['layer'])]),Tree('PP',[Tree('IN',['of'])),Tree('S',[Tree('VP' ,[Tree('VBN',['subducted']),Tree('NP',[Tree('NP',[Tree('JJ',['sedimentary'])),Tree('NN',[' rock'])]),Tree('PP',[Tree('IN',['along'])),Tree('NP',[Tree('NP',[Tree('NP', 'DT',['the']),Tree('JJS',['uppermost']),Tree('NN',['part'])]),Tree('PP',[Tree('IN ',['of']),Tree('NP',[Tree('DT',['the']),Tree('NN',['plate']),Tree('NN',[' interface'])])])])),Tree(',',[',']),Tree('UCP',[Tree('RB',['as'])),Tree('S', [Tree('VP',[Tree('VBZ',['is']]),Tree('VP',[Tr ee('VBN',['thought']),Tree('S',[Tree('VP',[Tree('TO',['to']]),Tree('VP',[Tree(' VB',['be']),Tree('ADJP',[Tree('JJ',['present']),Tree('PP',[Tree('IN',['in'])), Tree('NP',[Tree('NP',[Tree('NNS',['areas']))]),Tree('PP',[Tree('IN',['of'])),樹('NP',[Tree('JJ',['significant']),Tree('NN',['topography'])])])))))))),Tree('PP',[Tree ('IN',['at']),Tree('NP',[Tree('NP',[Tree('DT',['the']]),Tree('NN',['top']] )]),Tree('PP',[Tree('IN',['of'])),Tree('NP',[Tree('DT',['the']),Tree('JJ', ['oceanic']),Tree('NN',['crust'])])])))))))))))))))))))Tree) ]),Tree('CC',['and']),Tree('SBAR',[Tree('WHADVP',[Tree('WRB',['where']))]),Tree('S' ,[Tree('NP',[Tree('NN',['propagation'])])),Tree('VP',[Tree('VBD',['was'])),Tree('PP', [Tree('IN',['in']),Tree('NP',[Tree('DT',['an']),Tree('JJ',['up-dip'])),樹('NN',['direction'])])))),Tree(',',[',']),Tree('ADVP',[Tree('RB',['possible'])])) ])]]]]]]]]]]]])),Tree('S',[Tree('VP',[Tree('VBG',['reaching'])),Tree('NP', [Tree('DT',['the']),Tree('NN',['seafloor'])])])])))))))))))))))))))]]])))]) ]),Tree('。',['。'])]))])))

我的問題是,給定pared_tree,我如何才能獲得像top of the oceanic crust那樣的左側實體( a thin layer

我認為解析的樹的級別可能有用,但是當我查看樹的級別時卻迷失了方向,我不知道該怎么做。

我主要基於Python,Stanford NLP結果是使用Python包裝器( https://bitbucket.org/torotoki/corenlp-python )獲得的。

誰能幫我,也許指出一些方向?

您可以嘗試提取標記為NP子樹:

>>> from nltk import Tree
>>> parsed_tree = Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])

>>> np = [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP']
>>> np
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'areas', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'propagation', 'an up-dip direction', 'the seafloor']

但這會帶來很多噪音,因此,我們可以說沒有一個單詞是一個短語:

>>> np_mwe
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

還是很吵,假設一個名詞短語不應該包含逗號(不是必須的,但有用的技巧):

>>> np_mwe_nocomma = [j for j in [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP'] if j.count(' ') > 0 and j.count(',') == 0]
>>> np_mwe_nocomma
['Tsunami earthquakes', 'the presence', 'a thin layer', 'sedimentary rock', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

現在我們很容易在子樹中看到子樹,所以讓我們選擇采用更大的子樹:

>> x = []
>>> for i in sorted(np_mwe_nocomma, key=len, reverse=True):
...     for j in x:
...             if i in j:
...                     continue
...     print i
...     x.append(i)
... 
the uppermost part of the plate interface
areas of significant topography
the top of the oceanic crust
significant topography
Tsunami earthquakes
the plate interface
an up-dip direction
the uppermost part
the oceanic crust
sedimentary rock
the presence
a thin layer
the seafloor

我不確定這是否滿足您的需求,但是您對“實體”的定義需要更具體,否則解析器標記的幾乎所有NP都可以是“實體”

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM