如何使用 Python 从句子中提取基于名词的复合词？

Question

我通过以下代码使用nltk从句子中提取名词：

words = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(words)

然后我选择标有NN和NNP词性 (PoS) 标签的词。 但是，它只提取了“book”和“table”这样的单个名词，而忽略了“basketball shoe”这样的成对名词。 我应该怎么做才能扩展结果以包含此类复合名词对？

Answer 1

假设您只想查找名词-名词复合词（例如“书店”）而不是其他组合，例如noun - verb （例如“降雪”）或adj - noun （例如“热狗”），以下解决方案将捕获 2或多次连续出现NN 、 NNS 、 NNP或NNPS词性 (PoS) 标签。

例子

使用NLTK RegExpParser和下面解决方案中定义的自定义语法规则，从以下句子中提取三个复合名词（“篮球鞋”、“书店”和“花生酱”）：

约翰在书店吃花生酱时弄丢了他的篮球鞋

解决方案

from nltk import word_tokenize, pos_tag, RegexpParser

text = "John lost his basketball shoe in the book store while eating peanut butter"
tokenized = word_tokenize(text)  # Tokenize text
tagged = pos_tag(tokenized)  # Tag tokenized text with PoS tags

# Create custom grammar rule to find consecutive occurrences of nouns
my_grammar = r"""
CONSECUTIVE_NOUNS: {<N.*><N.*>+}"""


# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
    cp = RegexpParser(grammar)
    parse_tree = cp.parse(pos_tagged_text)
    # parse_tree.draw()  # Visualise parse tree
    return parse_tree


# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
    labels = []
    for line in grammar.splitlines()[1:]:
        labels.append(line.split(":")[0])
    return labels


# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
    matching_phrases = []
    for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
        # Get phrases only, drop PoS tags
        matching_phrases.append([leaf[0] for leaf in node.leaves()])
    return matching_phrases


text_parse_tree = get_parse_tree(my_grammar, tagged)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)

for phrase in phrases:
    print(phrase)

Output

['basketball', 'shoe']
['book', 'store']
['peanut', 'butter']

如何使用 Python 从句子中提取基于名词的复合词？

问题描述

1 个解决方案

解决方案1
0 2022-12-26 07:13:49

例子

解决方案

Output

如何使用 Python 从句子中提取基于名词的复合词？

问题描述

1 个解决方案

解决方案1 0 2022-12-26 07:13:49

例子

解决方案

Output

解决方案1
0 2022-12-26 07:13:49