Extract topic keywords from text

Question

I'm trying to extract a list of ingredients from a cooking recipe. To do that I made a list of many ingredients in a file, then I check all these ingredients against the recipe. Code looks like this:

ingredients = ['sugar', 'flour', 'apple']
found = []
recipe = '''
1 teaspoon of sugar
2 tablespoons of flour.
3 apples
'''
for ingredient in ingredients:
    if ingredient in recipe:
         found.append(ingredient)

I'm looking for a more efficient way to do that because the list of possible ingredients can be really big. Any ideas?

Answer 1

You could split your input and use sets:

ingredients = set(['sugar', 'flour', 'apple'])    
recipe_elements = set([i.strip() for i in recipe.split(' ')])
used_ingredients = ingredients & recipe_elements    # the intersection

You may need to do various clean ups on your input depending on where you get it from though. You'd need to benchmark to see whether this was actually any better though, and it wouldn't match 'apple' where the user entered 'apples' as in your example without extra work (make everything singular for example).

Answer 2

You could try part-of-speech (POS) tagging using nltk , keeping the nouns, and then excluding nouns which refer to quantities such as teaspoon , handful , etc. with a custom stoplist. That would give you a much smaller list to build/maintain manually and also a shorter list to check against like this:

ingredients = set(nouns) - set(stopwords)  # take the difference

In terms of making the actual check for ingredients in your recipe more efficient, you would be better off taking the intersection of words in your recipe (probably not worth doing POS tagging here) and the ingredients list as @jbrown suggests.

Extract topic keywords from text

Question

2 answers

solution1
2 ACCPTED 2016-01-07 16:19:44

solution2
1 2016-01-07 16:15:04

Extract topic keywords from text

Question

2 answers

solution1 2 ACCPTED 2016-01-07 16:19:44

solution2 1 2016-01-07 16:15:04

solution1
2 ACCPTED 2016-01-07 16:19:44

solution2
1 2016-01-07 16:15:04