简体   繁体   中英

Search through list of strings and determine if there is an exact match in separate list of strings. python. sentiment analysis

Suppose I have a list of keywords and a list of sentences:

keywords = ['foo', 'bar', 'joe', 'mauer']
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']

How can I loop through my listOfStrings and determine if they contain any of the keywords...Must be an exact match! Such that:

>>for i in listOfStrings:
    for p in keywords:
       if p in i:
         print i

>> 'mauer is awesome'

(because 'foobar' is NOT an exact match with 'foo' or 'bar', function should only catch 'foobar' if it is a keyword)

I suspect re.search may be the way, but I cant figure out how to loop through list, using variables rather than verbatim expressions using the re module.
Thanks

A much better idea for exact matches is to store the keywords in a set

keywords = {'foo', 'bar', 'joe', 'mauer'}
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']

[s for s in listOfStrings if any(w in keywords for w in s.split())]

This only tests each word in listOfStrings once. Your method (or using regex) looks at every word in listOfStrings for each keyword . As the number of keywords grows, that will be very inefficient.

If you surround a word with the regex metacharacter \\b then use it as a regex, it is required to match on word boundaries:

http://www.regular-expressions.info/wordboundaries.html

The metacharacter \\b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.

In addition, make sure that your python regex uses re.IGNORECASE : http://docs.python.org/2/library/re.html#re.IGNORECASE

And don't forget that \\ may be considered a metacharacter both in the language's string parser AND for the regex engine itself, meaning it will have to be doubled up into \\\\b .

Instead of checking if each keyword is contained anywhere in the string, you can break the sentences down into words, and check whether each of them is a keyword. Then you won't have problems with partial matches.

Here, RE_WORD is defined as the regular expression of a word-boundary, at least one character, and then another word boundary. You can use re.findall() to find all words in the string. re.compile() pre-compiles the regular expression so that it doesn't have to be parsed from scratch for every line.

frozenset() is an efficient data structure that can answer the question “is the given word in the frozen set?” faster than is possible by scanning through a long list of keywords and trying every one of them.

#!/usr/bin/env python2.7

import re

RE_WORD = re.compile(r'\b[a-zA-Z]+\b')

keywords = frozenset(['foo', 'bar', 'joe', 'mauer'])
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']

for i in listOfStrings:
    for word in RE_WORD.findall(i):
        if word in keywords:
            print i
            continue

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM