簡體   English   中英

基於label分離NLTK子樹

[英]Separate NLTK subtree based on label

我有一個 NLTK Parse 樹,我想僅基於“S”標簽來分離 Tree 的葉子。 請注意,S 不應與葉子重疊。

鑒於句子“他贏得了 Gusher Maraton,在 30 分鍾內完成。”

來自 corenlp 的樹形是

tree = '(S
  (NP (PRP He))
  (VP
    (VBD won)
    (NP (DT the) (NNP Gusher) (NNP Marathon))
    (, ,)
    (S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
  (. .))'

想法是提取2個“S”和它們的葉子,但不相互重疊。 所以預期的 output 應該是“他贏得了 Gusher 馬拉松,”。 和“在 30 分鍾內完成”。

# Tree manipulation

# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies;  Recursive
def ExtractPhrases( myTree, phrase):
    myPhrases = []
    if (myTree.label() == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
    for subtree in sep.subtrees():
        if subtree.label()=="S":
            print(subtree)
            subtexts.add(' '.join(subtree.leaves()))
            #break

subtexts = list(subtexts)
print(subtexts)

我得到了 output

['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']

我不想在字符串級別操作它,而是在樹級別操作,所以預期 output 會是-

["He won the Gusher Marathon ,.",  "finishing in 30 minutes."]

這是我的示例輸入:

a = 

'''

FREEDOM FROM RELIGION FOUNDATION

Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.

EVOLUTION DESIGNS

Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.

'''


    sentences = nltk.sent_tokenize(a)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    tagged_sentences = nltk.pos_tag_sents(sentences)
    chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))

    for sent in chunked_sentences:
    for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
        print(subtree)

這是我的 output:

(S
  (ORGANIZATION FREEDOM/NN)
  (ORGANIZATION FROM/NNP)
  RELIGION/NNP
  FOUNDATION/NNP
  Darwin/NNP
  fish/JJ
  bumper/NN
  stickers/NNS
  and/CC
  assorted/VBD
  other/JJ
  atheist/JJ
  paraphernalia/NNS
  are/VBP
  available/JJ
  from/IN
  the/DT
  (ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

(S
  (ORGANIZATION EVOLUTION/NNP)
  (ORGANIZATION DESIGNS/NNP Evolution/NNP)
  Designs/NNP
  sell/VB
  the/DT
  ``/``
  (PERSON Darwin/NNP)
  fish/NN
  ''/''
  ./.)

(S
  It/PRP
  's/VBZ
  a/DT
  fish/JJ
  symbol/NN
  ,/,
  like/IN
  the/DT
  ones/NNS
  Christians/NNPS
  stick/VBP
  on/IN
  their/PRP$
  cars/NNS
  ,/,
  but/CC
  with/IN
  feet/NNS
  and/CC
  the/DT
  word/NN
  ``/``
  (PERSON Darwin/NNP)
  ''/''
  written/VBN
  inside/RB
  ./.)

(S
  The/DT
  deluxe/NN
  moulded/VBD
  3D/CD
  plastic/JJ
  fish/NN
  is/VBZ
  $/$
  4.95/CD
  postpaid/NN
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM