[英]Separate NLTK subtree based on label
我有一個 NLTK Parse 樹,我想僅基於“S”標簽來分離 Tree 的葉子。 請注意,S 不應與葉子重疊。
鑒於句子“他贏得了 Gusher Maraton,在 30 分鍾內完成。”
來自 corenlp 的樹形是
tree = '(S
(NP (PRP He))
(VP
(VBD won)
(NP (DT the) (NNP Gusher) (NNP Marathon))
(, ,)
(S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
(. .))'
想法是提取2個“S”和它們的葉子,但不相互重疊。 所以預期的 output 應該是“他贏得了 Gusher 馬拉松,”。 和“在 30 分鍾內完成”。
# Tree manipulation
# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies; Recursive
def ExtractPhrases( myTree, phrase):
myPhrases = []
if (myTree.label() == phrase):
myPhrases.append( myTree.copy(True) )
for child in myTree:
if (type(child) is Tree):
list_of_phrases = ExtractPhrases(child, phrase)
if (len(list_of_phrases) > 0):
myPhrases.extend(list_of_phrases)
return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
for subtree in sep.subtrees():
if subtree.label()=="S":
print(subtree)
subtexts.add(' '.join(subtree.leaves()))
#break
subtexts = list(subtexts)
print(subtexts)
我得到了 output
['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']
我不想在字符串級別操作它,而是在樹級別操作,所以預期 output 會是-
["He won the Gusher Marathon ,.", "finishing in 30 minutes."]
這是我的示例輸入:
a =
'''
FREEDOM FROM RELIGION FOUNDATION
Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.
EVOLUTION DESIGNS
Evolution Designs sell the "Darwin fish". It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside. The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.
'''
sentences = nltk.sent_tokenize(a)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = nltk.pos_tag_sents(sentences)
chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))
for sent in chunked_sentences:
for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
print(subtree)
這是我的 output:
(S
(ORGANIZATION FREEDOM/NN)
(ORGANIZATION FROM/NNP)
RELIGION/NNP
FOUNDATION/NNP
Darwin/NNP
fish/JJ
bumper/NN
stickers/NNS
and/CC
assorted/VBD
other/JJ
atheist/JJ
paraphernalia/NNS
are/VBP
available/JJ
from/IN
the/DT
(ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
in/IN
the/DT
(GSP US/NNP)
./.)
(S
(ORGANIZATION EVOLUTION/NNP)
(ORGANIZATION DESIGNS/NNP Evolution/NNP)
Designs/NNP
sell/VB
the/DT
``/``
(PERSON Darwin/NNP)
fish/NN
''/''
./.)
(S
It/PRP
's/VBZ
a/DT
fish/JJ
symbol/NN
,/,
like/IN
the/DT
ones/NNS
Christians/NNPS
stick/VBP
on/IN
their/PRP$
cars/NNS
,/,
but/CC
with/IN
feet/NNS
and/CC
the/DT
word/NN
``/``
(PERSON Darwin/NNP)
''/''
written/VBN
inside/RB
./.)
(S
The/DT
deluxe/NN
moulded/VBD
3D/CD
plastic/JJ
fish/NN
is/VBZ
$/$
4.95/CD
postpaid/NN
in/IN
the/DT
(GSP US/NNP)
./.)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.