简体   繁体   English

搜索字符串列表,并确定单独的字符串列表中是否存在完全匹配的字符串。 蟒蛇。 情绪分析

[英]Search through list of strings and determine if there is an exact match in separate list of strings. python. sentiment analysis

Suppose I have a list of keywords and a list of sentences: 假设我有一个关键字列表和一个句子列表:

keywords = ['foo', 'bar', 'joe', 'mauer']
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']

How can I loop through my listOfStrings and determine if they contain any of the keywords...Must be an exact match! 如何遍历listOfStrings并确定它们是否包含任何关键字...必须完全匹配! Such that: 这样:

>>for i in listOfStrings:
    for p in keywords:
       if p in i:
         print i

>> 'mauer is awesome'

(because 'foobar' is NOT an exact match with 'foo' or 'bar', function should only catch 'foobar' if it is a keyword) (由于'foobar'与'foo'或'bar'不完全匹配,因此,如果该函数是关键字,则函数应仅捕获'foobar')

I suspect re.search may be the way, but I cant figure out how to loop through list, using variables rather than verbatim expressions using the re module. 我怀疑re.search可能是这样,但是我无法弄清楚如何使用变量而不是使用re模块的逐字表达式来遍历列表。
Thanks 谢谢

A much better idea for exact matches is to store the keywords in a set 完全匹配的一个更好的主意是将关键字存储在集合中

keywords = {'foo', 'bar', 'joe', 'mauer'}
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']

[s for s in listOfStrings if any(w in keywords for w in s.split())]

This only tests each word in listOfStrings once. 这只会测试listOfStrings每个单词一次。 Your method (or using regex) looks at every word in listOfStrings for each keyword . 您的方法(或使用正则表达式)查看listOfStrings每个关键字的每个单词。 As the number of keywords grows, that will be very inefficient. 随着关键字数量的增加,效率将非常低下。

If you surround a word with the regex metacharacter \\b then use it as a regex, it is required to match on word boundaries: 如果用正则表达式元字符\\b包围单词,然后将其用作正则表达式,则需要在单词边界处进行匹配:

http://www.regular-expressions.info/wordboundaries.html http://www.regular-expressions.info/wordboundaries.html

The metacharacter \\b is an anchor like the caret and the dollar sign. 元字符\\ b是类似于插入符号和美元符号的锚。 It matches at a position that is called a "word boundary". 它在称为“单词边界”的位置匹配。 This match is zero-length. 此匹配为零长度。

In addition, make sure that your python regex uses re.IGNORECASE : http://docs.python.org/2/library/re.html#re.IGNORECASE 另外,请确保您的python正则表达式使用re.IGNORECASEhttp : re.IGNORECASE

And don't forget that \\ may be considered a metacharacter both in the language's string parser AND for the regex engine itself, meaning it will have to be doubled up into \\\\b . 并且不要忘记\\在语言的字符串解析器和正则表达式引擎本身中都可以被视为元字符,这意味着必须将其加倍为\\\\b

Instead of checking if each keyword is contained anywhere in the string, you can break the sentences down into words, and check whether each of them is a keyword. 无需检查每个关键字是否包含在字符串中的任何位置,而是可以将句子分解为单词,然后检查每个单词是否都是关键字。 Then you won't have problems with partial matches. 这样一来,部分匹配就不会有问题。

Here, RE_WORD is defined as the regular expression of a word-boundary, at least one character, and then another word boundary. 在此, RE_WORD被定义为单词边界的正则表达式,至少一个字符,然后是另一个单词边界。 You can use re.findall() to find all words in the string. 您可以使用re.findall()查找字符串中的所有单词。 re.compile() pre-compiles the regular expression so that it doesn't have to be parsed from scratch for every line. re.compile()预编译正则表达式,这样就不必从头开始解析每一行。

frozenset() is an efficient data structure that can answer the question “is the given word in the frozen set?” faster than is possible by scanning through a long list of keywords and trying every one of them. frozenset()是一种高效的数据结构 ,它可以通过扫描一长串关键字并尝试每个关键字来更快地回答“冻结集中的给定单词?”这个问题。

#!/usr/bin/env python2.7

import re

RE_WORD = re.compile(r'\b[a-zA-Z]+\b')

keywords = frozenset(['foo', 'bar', 'joe', 'mauer'])
listOfStrings = ['I am frustrated', 'this task is foobar', 'mauer is awesome']

for i in listOfStrings:
    for word in RE_WORD.findall(i):
        if word in keywords:
            print i
            continue

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM