繁体   English   中英

基于label分离NLTK子树

[英]Separate NLTK subtree based on label

我有一个 NLTK Parse 树,我想仅基于“S”标签来分离 Tree 的叶子。 请注意,S 不应与叶子重叠。

鉴于句子“他赢得了 Gusher Maraton,在 30 分钟内完成。”

来自 corenlp 的树形是

tree = '(S
  (NP (PRP He))
  (VP
    (VBD won)
    (NP (DT the) (NNP Gusher) (NNP Marathon))
    (, ,)
    (S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
  (. .))'

想法是提取2个“S”和它们的叶子,但不相互重叠。 所以预期的 output 应该是“他赢得了 Gusher 马拉松,”。 和“在 30 分钟内完成”。

# Tree manipulation

# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies;  Recursive
def ExtractPhrases( myTree, phrase):
    myPhrases = []
    if (myTree.label() == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
    for subtree in sep.subtrees():
        if subtree.label()=="S":
            print(subtree)
            subtexts.add(' '.join(subtree.leaves()))
            #break

subtexts = list(subtexts)
print(subtexts)

我得到了 output

['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']

我不想在字符串级别操作它,而是在树级别操作,所以预期 output 会是-

["He won the Gusher Marathon ,.",  "finishing in 30 minutes."]

这是我的示例输入:

a = 

'''

FREEDOM FROM RELIGION FOUNDATION

Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.

EVOLUTION DESIGNS

Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.

'''


    sentences = nltk.sent_tokenize(a)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    tagged_sentences = nltk.pos_tag_sents(sentences)
    chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))

    for sent in chunked_sentences:
    for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
        print(subtree)

这是我的 output:

(S
  (ORGANIZATION FREEDOM/NN)
  (ORGANIZATION FROM/NNP)
  RELIGION/NNP
  FOUNDATION/NNP
  Darwin/NNP
  fish/JJ
  bumper/NN
  stickers/NNS
  and/CC
  assorted/VBD
  other/JJ
  atheist/JJ
  paraphernalia/NNS
  are/VBP
  available/JJ
  from/IN
  the/DT
  (ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

(S
  (ORGANIZATION EVOLUTION/NNP)
  (ORGANIZATION DESIGNS/NNP Evolution/NNP)
  Designs/NNP
  sell/VB
  the/DT
  ``/``
  (PERSON Darwin/NNP)
  fish/NN
  ''/''
  ./.)

(S
  It/PRP
  's/VBZ
  a/DT
  fish/JJ
  symbol/NN
  ,/,
  like/IN
  the/DT
  ones/NNS
  Christians/NNPS
  stick/VBP
  on/IN
  their/PRP$
  cars/NNS
  ,/,
  but/CC
  with/IN
  feet/NNS
  and/CC
  the/DT
  word/NN
  ``/``
  (PERSON Darwin/NNP)
  ''/''
  written/VBN
  inside/RB
  ./.)

(S
  The/DT
  deluxe/NN
  moulded/VBD
  3D/CD
  plastic/JJ
  fish/NN
  is/VBZ
  $/$
  4.95/CD
  postpaid/NN
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM