Suppose I have a list of keywords and a list of sentences:
keywords = ['foo', 'bar', 'joe', 'mauer']
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']
How can I loop through my listOfStrings and determine if they contain any of the keywords...Must be an exact match! Such that:
>>for i in listOfStrings:
for p in keywords:
if p in i:
print i
>> 'mauer is awesome'
(because 'foobar' is NOT an exact match with 'foo' or 'bar', function should only catch 'foobar' if it is a keyword)
I suspect re.search may be the way, but I cant figure out how to loop through list, using variables rather than verbatim expressions using the re module.
Thanks
A much better idea for exact matches is to store the keywords in a set
keywords = {'foo', 'bar', 'joe', 'mauer'}
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']
[s for s in listOfStrings if any(w in keywords for w in s.split())]
This only tests each word in listOfStrings
once. Your method (or using regex) looks at every word in listOfStrings
for each keyword . As the number of keywords grows, that will be very inefficient.
If you surround a word with the regex metacharacter \\b
then use it as a regex, it is required to match on word boundaries:
http://www.regular-expressions.info/wordboundaries.html
The metacharacter \\b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
In addition, make sure that your python regex uses re.IGNORECASE
: http://docs.python.org/2/library/re.html#re.IGNORECASE
And don't forget that \\
may be considered a metacharacter both in the language's string parser AND for the regex engine itself, meaning it will have to be doubled up into \\\\b
.
Instead of checking if each keyword is contained anywhere in the string, you can break the sentences down into words, and check whether each of them is a keyword. Then you won't have problems with partial matches.
Here, RE_WORD
is defined as the regular expression of a word-boundary, at least one character, and then another word boundary. You can use re.findall()
to find all words in the string. re.compile()
pre-compiles the regular expression so that it doesn't have to be parsed from scratch for every line.
frozenset()
is an efficient data structure that can answer the question “is the given word in the frozen set?” faster than is possible by scanning through a long list of keywords and trying every one of them.
#!/usr/bin/env python2.7
import re
RE_WORD = re.compile(r'\b[a-zA-Z]+\b')
keywords = frozenset(['foo', 'bar', 'joe', 'mauer'])
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']
for i in listOfStrings:
for word in RE_WORD.findall(i):
if word in keywords:
print i
continue
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.