[英]How do I identify an object is of type NLTK Tree and then parse it?
I am trying to get GPE locations from a message after tokenizing it.我试图在标记化消息后从消息中获取 GPE 位置。
from nltk import ne_chunk
print(ne_chunk(pos_words[0]))
Output:输出:
Weather/NNP
update/VB
a/DT
cold/JJ
front/NN
from/IN
(GPE Cuba/NNP)
that/WDT
could/MD
pass/VB
over/RP
(PERSON Haiti/NNP))
I want to get the output Cuba as a string.我想将输出 Cuba 作为一个字符串。 How can I access that?我怎样才能访问它?
Edit: I am trying to generalize the extraction to the dataframe by making a list of locations.编辑:我试图通过制作位置列表来将提取推广到数据框。 This is the function I made.这是我做的功能。 However, it splits multi-word locations like New York into [New, York]但是,它将像纽约这样的多词位置拆分为 [New, York]
locations = []
for i in range(len(pos_words)):
chunks = ne_chunk(pos_words[i])
for c in chunks:
if isinstance(c, Tree) and c.label() == 'GPE':
# The object is <class 'nltk.tree.Tree'> and label is Geopolitical Entity
locations.extend([w for w,_ in c.leaves()])
return locations
import nltk
from nltk import Tree
text = 'Weather update a cold front from Cuba that could pass over Hatti'
# Tokenize and tag
pos_words = nltk.pos_tag(nltk.word_tokenize(text))
# Named entity chunker
chunks = nltk.ne_chunk(pos_words)
for c in chunks:
if isinstance(c, Tree) and c.label() == 'GPE':
# The object is <class 'nltk.tree.Tree'> and label is Geopolitical Entity
print(' '.join([w for w, _ in c.leaves()]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.