简体   繁体   中英

How to match the keywords in paragraph using python (nltk)

Keywords:

Keywords={u'secondary': [u'sales growth', u'next generation store', u'Steps Down', u' Profit warning', u'Store Of The Future', u'groceries']}

Paragraph:

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

it there any way to match the keywords in paragraph?(without using regex)

Output:

Matched keywords : next generation store , groceries

No need to use NLTK for this. First of all you will have to clean you text in the paragraph, or change your values in the list for the 'secondary key. '"next generation" store' and 'next generation store' are two different things.

After this you can iterate over the values of 'secondary', and check if any of those strings exist in your text.

match = [i for i in Keywords['secondary'] if i in paragraph]

EDIT: As i specified above, '"next generation" store' and 'next generation store' are two different things, which is the reason you only get 1 match. If you had 'next generation store' and 'next generation store' you would get two matches - as there are in fact two matches.

INPUT :

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

OUTPUT:

['groceries']

INPUT :

paragraph="""HOUSTON -- Target has unveiled its first next generation store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.

The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.

Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

OUTPUT:

['next generation store','groceries']

Firstly, you don't really need a dict if your keywords has only one key. Use a set() instead.

Keywords={u'secondary': [u'sales growth', u'next generation store', 
                         u'Steps Down', u' Profit warning', 
                         u'Store Of The Future', u'groceries']}

keywords = {u'sales growth', u'next generation store', 
            u'Steps Down', u' Profit warning', 
            u'Store Of The Future', u'groceries'}

paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""

Then a minor tweak from Find multi-word terms in a tokenized text in Python

from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize

mwe = MWETokenizer([k.lower().split() for k in keywords], separator='_')

# Clean out the punctuations in your sentence.
import string
puncts = list(string.punctuation)
cleaned_paragraph = ''.join([ch if ch not in puncts else '' for ch in paragraph.lower()])

tokenized_paragraph = [token for token in mwe.tokenize(word_tokenize(cleaned_paragraph))
                       if token.replace('_', ' ') in keywords]

print(tokenized_paragraph)

[out]:

>>> print(tokenized_paragraph)
['next_generation_store', 'groceries'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM