简体   繁体   English

基于label分离NLTK子树

[英]Separate NLTK subtree based on label

I have a NLTK Parse tree, I want to separate Tree's leaves based on only the "S" labels.我有一个 NLTK Parse 树,我想仅基于“S”标签来分离 Tree 的叶子。 Note that, S should not overlap leaves.请注意,S 不应与叶子重叠。

Given the sentence "He won the Gusher Maraton, finishing in 30 minutes. "鉴于句子“他赢得了 Gusher Maraton,在 30 分钟内完成。”

The tree form from the corenlp is来自 corenlp 的树形是

tree = '(S
  (NP (PRP He))
  (VP
    (VBD won)
    (NP (DT the) (NNP Gusher) (NNP Marathon))
    (, ,)
    (S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
  (. .))'

Idea is to extract 2 "S" and their leaves but not overlapping with each other.想法是提取2个“S”和它们的叶子,但不相互重叠。 So the expected output should be "He won the Gusher Marathon,."所以预期的 output 应该是“他赢得了 Gusher 马拉松,”。 and "finishing in 30 minutes."和“在 30 分钟内完成”。

# Tree manipulation

# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies;  Recursive
def ExtractPhrases( myTree, phrase):
    myPhrases = []
    if (myTree.label() == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
    for subtree in sep.subtrees():
        if subtree.label()=="S":
            print(subtree)
            subtexts.add(' '.join(subtree.leaves()))
            #break

subtexts = list(subtexts)
print(subtexts)

I got the output我得到了 output

['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']

I dont want to manipulate it in string level, rather tree level so expected output would be-我不想在字符串级别操作它,而是在树级别操作,所以预期 output 会是-

["He won the Gusher Marathon ,.",  "finishing in 30 minutes."]

This is my sample input:这是我的示例输入:

a = 

'''

FREEDOM FROM RELIGION FOUNDATION

Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.

EVOLUTION DESIGNS

Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.

'''


    sentences = nltk.sent_tokenize(a)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    tagged_sentences = nltk.pos_tag_sents(sentences)
    chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))

    for sent in chunked_sentences:
    for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
        print(subtree)

Here is my output:这是我的 output:

(S
  (ORGANIZATION FREEDOM/NN)
  (ORGANIZATION FROM/NNP)
  RELIGION/NNP
  FOUNDATION/NNP
  Darwin/NNP
  fish/JJ
  bumper/NN
  stickers/NNS
  and/CC
  assorted/VBD
  other/JJ
  atheist/JJ
  paraphernalia/NNS
  are/VBP
  available/JJ
  from/IN
  the/DT
  (ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

(S
  (ORGANIZATION EVOLUTION/NNP)
  (ORGANIZATION DESIGNS/NNP Evolution/NNP)
  Designs/NNP
  sell/VB
  the/DT
  ``/``
  (PERSON Darwin/NNP)
  fish/NN
  ''/''
  ./.)

(S
  It/PRP
  's/VBZ
  a/DT
  fish/JJ
  symbol/NN
  ,/,
  like/IN
  the/DT
  ones/NNS
  Christians/NNPS
  stick/VBP
  on/IN
  their/PRP$
  cars/NNS
  ,/,
  but/CC
  with/IN
  feet/NNS
  and/CC
  the/DT
  word/NN
  ``/``
  (PERSON Darwin/NNP)
  ''/''
  written/VBN
  inside/RB
  ./.)

(S
  The/DT
  deluxe/NN
  moulded/VBD
  3D/CD
  plastic/JJ
  fish/NN
  is/VBZ
  $/$
  4.95/CD
  postpaid/NN
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM