简体   繁体   English

如何使用python(nltk)匹配段落中的关键字

[英]How to match the keywords in paragraph using python (nltk)

Keywords:关键词:

Keywords={u'secondary': [u'sales growth', u'next generation store', u'Steps Down', u' Profit warning', u'Store Of The Future', u'groceries']}

Paragraph:段落:

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

it there any way to match the keywords in paragraph?(without using regex)有没有办法匹配段落中的关键字?(不使用正则表达式)

Output:输出:

Matched keywords : next generation store , groceries匹配关键词:下一代商店,杂货

No need to use NLTK for this.无需为此使用 NLTK。 First of all you will have to clean you text in the paragraph, or change your values in the list for the 'secondary key.首先,您必须清理段落中的文本,或更改“辅助键”列表中的值。 '"next generation" store' and 'next generation store' are two different things. “下一代”商店和“下一代商店”是两个不同的东西。

After this you can iterate over the values of 'secondary', and check if any of those strings exist in your text.在此之后,您可以迭代 'secondary' 的值,并检查您的文本中是否存在任何这些字符串。

match = [i for i in Keywords['secondary'] if i in paragraph]

EDIT: As i specified above, '"next generation" store' and 'next generation store' are two different things, which is the reason you only get 1 match.编辑:正如我上面指定的,“下一代”商店和“下一代商店”是两个不同的东西,这就是你只能得到 1 个匹配的原因。 If you had 'next generation store' and 'next generation store' you would get two matches - as there are in fact two matches.如果你有“下一代商店”和“下一代商店”,你会得到两场比赛——因为实际上有两场比赛。

INPUT :输入

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

OUTPUT:输出:

['groceries']

INPUT :输入

paragraph="""HOUSTON -- Target has unveiled its first next generation store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

OUTPUT:输出:

['next generation store','groceries']

Firstly, you don't really need a dict if your keywords has only one key.首先,如果您的关键字只有一个键,则您实际上并不需要dict Use a set() instead.使用set()代替。

Keywords={u'secondary': [u'sales growth', u'next generation store', 
                         u'Steps Down', u' Profit warning', 
                         u'Store Of The Future', u'groceries']}

keywords = {u'sales growth', u'next generation store', 
            u'Steps Down', u' Profit warning', 
            u'Store Of The Future', u'groceries'}

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

Then a minor tweak from Find multi-word terms in a tokenized text in Python然后对 Python 中标记化文本中的 Find multi-word term 做一个小调整

from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize

mwe = MWETokenizer([k.lower().split() for k in keywords], separator='_')

# Clean out the punctuations in your sentence.
import string
puncts = list(string.punctuation)
cleaned_paragraph = ''.join([ch if ch not in puncts else '' for ch in paragraph.lower()])

tokenized_paragraph = [token for token in mwe.tokenize(word_tokenize(cleaned_paragraph))
                       if token.replace('_', ' ') in keywords]

print(tokenized_paragraph)

[out]: [出去]:

>>> print(tokenized_paragraph)
['next_generation_store', 'groceries'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM