[英]How to extract noun-based compound words from a sentence using Python?
我通過以下代碼使用nltk
從句子中提取名詞:
words = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(words)
然后我選擇標有NN
和NNP
詞性 (PoS) 標簽的詞。 但是,它只提取了“book”和“table”這樣的單個名詞,而忽略了“basketball shoe”這樣的成對名詞。 我應該怎么做才能擴展結果以包含此類復合名詞對?
假設您只想查找名詞-名詞復合詞(例如“書店”)而不是其他組合,例如noun
- verb
(例如“降雪”)或adj
- noun
(例如“熱狗”),以下解決方案將捕獲 2或多次連續出現NN 、 NNS 、 NNP或NNPS詞性 (PoS) 標簽。
使用NLTK RegExpParser
和下面解決方案中定義的自定義語法規則,從以下句子中提取三個復合名詞(“籃球鞋”、“書店”和“花生醬”):
約翰在書店吃花生醬時弄丟了他的籃球鞋
from nltk import word_tokenize, pos_tag, RegexpParser
text = "John lost his basketball shoe in the book store while eating peanut butter"
tokenized = word_tokenize(text) # Tokenize text
tagged = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Create custom grammar rule to find consecutive occurrences of nouns
my_grammar = r"""
CONSECUTIVE_NOUNS: {<N.*><N.*>+}"""
# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
cp = RegexpParser(grammar)
parse_tree = cp.parse(pos_tagged_text)
# parse_tree.draw() # Visualise parse tree
return parse_tree
# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
labels = []
for line in grammar.splitlines()[1:]:
labels.append(line.split(":")[0])
return labels
# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
matching_phrases = []
for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
# Get phrases only, drop PoS tags
matching_phrases.append([leaf[0] for leaf in node.leaves()])
return matching_phrases
text_parse_tree = get_parse_tree(my_grammar, tagged)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
['basketball', 'shoe']
['book', 'store']
['peanut', 'butter']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.