简体   繁体   English

Python:如何确定字符串中是否存在单词列表

[英]Python: how to determine if a list of words exist in a string

Given a list ["one", "two", "three"] , how to determine if each word exist in a specified string?给定一个列表["one", "two", "three"] ,如何确定每个单词是否存在于指定的字符串中?

The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run)单词列表很短(在我的例子中不到 20 个单词),但要搜索的字符串非常大(每次运行 400,000 个字符串)

My current implementation uses re to look for matches but I'm not sure if it's the best way.我当前的实现使用re来查找匹配项,但我不确定这是否是最好的方法。

import re
word_list = ["one", "two", "three"]
regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list)

finder = re.compile(regex_string)
string_to_be_searched = "one two three"

results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
    if word in result_set:
        print("%s in string" % word)

Problems in my solution:我的解决方案中的问题:

  1. It will search until the end of the string, although the words may appear in the first half of the string它会搜索到字符串的末尾,尽管单词可能出现在字符串的前半部分
  2. In order to overcome the limitation of lookahead assertion (I don't know how to express "the character before current match should be non-word characters, or the start of the string"), I added extra space before and after the string I need to be searched.为了克服lookahead assertion的限制(不知道如何表达“当前匹配之前的字符应该是非单词字符,或者字符串的开头”),我在字符串前后添加了额外的空格需要搜索。
  3. Other performance issue introduced by the lookahead assertion?前瞻断言引入的其他性能问题?

Possible simpler implementation:可能的更简单的实现:

  1. just loop through the word list and do a if word in string_to_be_searched .只需遍历单词列表并if word in string_to_be_searched执行if word in string_to_be_searched But it can not deal with "threesome" if you are looking for "three"但是如果你要找“三人行”,它就不能处理“三人行”
  2. Use one regular expression search for one word.使用一个正则表达式搜索一个词。 Still I'm not sure about the performance, and the potential of searching string multiple times.我仍然不确定性能以及多次搜索字符串的潜力。

UPDATE:更新:

I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance.我已经接受了 Aaron Hall 的回答https://stackoverflow.com/a/21718896/683321,因为根据 Peter Gibson 的基准https://stackoverflow.com/a/21742190/683321,这个简单版本的性能最好。 If you are interested in this problem, you can read all the answers and get a better view.如果您对这个问题感兴趣,可以阅读所有答案并获得更好的视图。

Actually I forgot to mention another constraint in my original problem.实际上我忘了在我原来的问题中提到另一个约束。 The word can be a phrase, for example: word_list = ["one day", "second day"] .单词可以是短语,例如: word_list = ["one day", "second day"] Maybe I should ask another question.也许我应该问另一个问题。

This function was found by Peter Gibson (below) to be the most performant of the answers here. Peter Gibson(下文)发现此函数是此处答案中性能最高的。 It is good for datasets one may hold in memory (because it creates a list of words from the string to be searched and then a set of those words):这对于可能保存在内存中的数据集很有用(因为它从要搜索的字符串中创建了一个单词列表,然后是一组这些单词):

def words_in_string(word_list, a_string):
    return set(word_list).intersection(a_string.split())

Usage:用法:

my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
    print('One or more words found!')

Which prints One or words found!其中打印One or words found! to stdout.到标准输出。

It does return the actual words found:确实返回找到的实际单词:

for word in words_in_string(my_word_list, a_string):
    print(word)

Prints out:打印出来:

three
two
one

For data so large you can't hold it in memory, the solution given in this answer would be very performant. 对于如此大的数据,您无法将其保存在内存中,此答案中给出的解决方案将非常高效。

To satisfy my own curiosity, I've timed the posted solutions.为了满足我自己的好奇心,我对发布的解决方案进行了计时。 Here are the results:结果如下:

TESTING: words_in_str_peter_gibson          0.207071995735
TESTING: words_in_str_devnull               0.55300579071
TESTING: words_in_str_perreal               0.159866499901
TESTING: words_in_str_mie                   Test #1 invalid result: None
TESTING: words_in_str_adsmith               0.11831510067
TESTING: words_in_str_gnibbler              0.175446796417
TESTING: words_in_string_aaron_hall         0.0834425926208
TESTING: words_in_string_aaron_hall2        0.0266295194626
TESTING: words_in_str_john_pirie            <does not complete>

Interestingly @AaronHall's solution有趣的是@AaronHall 的解决方案

def words_in_string(word_list, a_string):
    return set(a_list).intersection(a_string.split())

which is the fastest, is also one of the shortest!这是最快的,也是最短的之一! Note it doesn't handle punctuation next to words, but it's not clear from the question whether that is a requirement.请注意,它不处理单词旁边的标点符号,但从问题中不清楚这是否是一项要求。 This solution was also suggested by @MIE and @user3. @MIE 和@user3 也建议了此解决方案。

I didn't look very long at why two of the solutions did not work.我没有看很长时间为什么两个解决方案不起作用。 Apologies if this is my mistake.如果这是我的错误,请道歉。 Here is the code for the tests, comments & corrections are welcome这是测试的代码,欢迎评论和更正

from __future__ import print_function
import re
import string
import random
words = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']

def random_words(length):
    letters = ''.join(set(string.ascii_lowercase) - set(''.join(words))) + ' '
    return ''.join(random.choice(letters) for i in range(int(length)))

LENGTH = 400000
RANDOM_STR = random_words(LENGTH/100) * 100
TESTS = (
    (RANDOM_STR + ' one two three', (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR + ' one two three four five six seven eight nine ten', (
        ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
        set(['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']),
        True,
        [True] * 10,
        {'one': True, 'two': True, 'three': True, 'four': True, 'five': True, 'six': True,
            'seven': True, 'eight': True, 'nine': True, 'ten':True}
        )),

    ('one two three ' + RANDOM_STR, (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR, (
        [],
        set(),
        False,
        [False] * 10,
        {'one': False, 'two': False, 'three': False, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR + ' one two three ' + RANDOM_STR, (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    ('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' three', (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    ('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' threesome', (
        ['one', 'two'],
        set(['one', 'two']),
        False,
        [True] * 2 + [False] * 8,
        {'one': True, 'two': True, 'three': False, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    )

def words_in_str_peter_gibson(words, s):
    words = words[:]
    found = []
    for match in re.finditer('\w+', s):
        word = match.group()
        if word in words:
            found.append(word)
            words.remove(word)
            if len(words) == 0: break
    return found

def words_in_str_devnull(word_list, inp_str1):
    return dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str1))) for word in word_list)


def words_in_str_perreal(wl, s):
    i, swl, strwords = 0, sorted(wl), sorted(s.split())
    for w in swl:
        while strwords[i] < w:  
            i += 1
            if i >= len(strwords): return False
        if w != strwords[i]: return False
    return True

def words_in_str_mie(search_list, string):
    lower_string=string.lower()
    if ' ' in lower_string:
        result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
        substr=lower_string[:lower_string.find(' ')]
        if substr in search_list and substr not in result:
            result+=substr
        substr=lower_string[lower_string.rfind(' ')+1:]
        if substr in search_list and substr not in result:
            result+=substr
    else:
        if lower_string in search_list:
            result=[lower_string]

def words_in_str_john_pirie(word_list, to_be_searched):
    for word in word_list:
        found = False
        while not found:
            offset = 0
            # Regex is expensive; use find
            index = to_be_searched.find(word, offset)
            if index < 0:
                # Not found
                break
            if index > 0 and to_be_searched[index - 1] != " ":
                # Found, but substring of a larger word; search rest of string beyond
                offset = index + len(word)
                continue
            if index + len(word) < len(to_be_searched) \
                    and to_be_searched[index + len(word)] != " ":
                # Found, but substring of larger word; search rest of string beyond
                offset = index + len(word)
                continue
            # Found exact word match
            found = True    
    return found

def words_in_str_gnibbler(words, string_to_be_searched):
    word_set = set(words)
    found = []
    for match in re.finditer(r"\w+", string_to_be_searched):
        w = match.group()
        if w in word_set:
             word_set.remove(w)
             found.append(w)
    return found

def words_in_str_adsmith(search_list, big_long_string):
    counter = 0
    for word in big_long_string.split(" "):
        if word in search_list: counter += 1
        if counter == len(search_list): return True
    return False

def words_in_string_aaron_hall(word_list, a_string):
    def words_in_string(word_list, a_string):
        '''return iterator of words in string as they are found'''
        word_set = set(word_list)
        pattern = r'\b({0})\b'.format('|'.join(word_list))
        for found_word in re.finditer(pattern, a_string):
            word = found_word.group(0)
            if word in word_set:
                word_set.discard(word)
                yield word
                if not word_set:
                    raise StopIteration
    return list(words_in_string(word_list, a_string))

def words_in_string_aaron_hall2(word_list, a_string):
    return set(word_list).intersection(a_string.split())

ALGORITHMS = (
        words_in_str_peter_gibson,
        words_in_str_devnull,
        words_in_str_perreal,
        words_in_str_mie,
        words_in_str_adsmith,
        words_in_str_gnibbler,
        words_in_string_aaron_hall,
        words_in_string_aaron_hall2,
        words_in_str_john_pirie,
        )

def test(alg):
    for i, (s, possible_results) in enumerate(TESTS):
        result = alg(words, s)
        assert result in possible_results, \
            'Test #%d invalid result: %s ' % (i+1, repr(result))

COUNT = 10
if __name__ == '__main__':
    import timeit
    for alg in ALGORITHMS:
        print('TESTING:', alg.__name__, end='\t\t')
        try:
            print(timeit.timeit(lambda: test(alg), number=COUNT)/COUNT)
        except Exception as e:
            print(e)

Easy way:简单的方法:

filter(lambda x:x in string,search_list)

if you want the search to ignore character's case you can do this:如果您希望搜索忽略字符的大小写,您可以这样做:

lower_string=string.lower()
filter(lambda x:x.lower() in lower_string,search_list)

if you want to ignore words that are part of bigger word such as three in threesome:如果您想忽略属于较大单词的单词,例如三合一:

lower_string=string.lower()
result=[]
if ' ' in lower_string:
    result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
    substr=lower_string[:lower_string.find(' ')]
    if substr in search_list and substr not in result:
        result+=[substr]
    substr=lower_string[lower_string.rfind(' ')+1:]
    if substr in search_list and substr not in result:
        result+=[substr]
else:
    if lower_string in search_list:
        result=[lower_string]


If performance is needed: 如果需要性能:

 arr=string.split(' ') result=list(set(arr).intersection(set(search_list)))

EDIT: this method was the fastest in an example that searches for 1,000 words in a string containing 400,000 words but if we increased the string to be 4,000,000 the previous method is faster.编辑:在一个包含 400,000 个单词的字符串中搜索 1,000 个单词的示例中,此方法是最快的,但如果我们将字符串增加到 4,000,000,则前一种方法更快。


if string is too long you should do low level search and avoid converting it to list: 如果字符串太长,您应该进行低级搜索并避免将其转换为列表:

 def safe_remove(arr,elem): try: arr.remove(elem) except: pass not_found=search_list[:] i=string.find(' ') j=string.find(' ',i+1) safe_remove(not_found,string[:i]) while j!=-1: safe_remove(not_found,string[i+1:j]) i,j=j,string.find(' ',j+1) safe_remove(not_found,string[i+1:])

not_found list contains words that are not found, you can get the found list easily, one way is list(set(search_list)-set(not_found)) not_found列表包含未找到的单词,您可以轻松获取找到的列表,一种方法是list(set(search_list)-set(not_found))

EDIT: the last method appears to be the slowest.编辑:最后一种方法似乎是最慢的。

def words_in_str(s, wl):
    i, swl, strwords = 0, sorted(wl), sorted(s.split())
    for w in swl:
        while strwords[i] < w:  
            i += 1
            if i >= len(strwords): return False
        if w != strwords[i]: return False
    return True

You can try this:你可以试试这个:

list(set(s.split()).intersection(set(w)))

It return only matched words from your word list.它仅从您的单词列表中返回匹配的单词。 If no words matched, it would return empty list.如果没有匹配的单词,它将返回空列表。

If your string is long and your search list is short, do this:如果您的字符串很长而您的搜索列表很短,请执行以下操作:

def search_string(big_long_string,search_list)
    counter = 0
    for word in big_long_string.split(" "):
        if word in search_list: counter += 1
        if counter == len(search_list): return True
    return False

You could make use of word boundaries:您可以使用单词边界:

>>> import re
>>> word_list = ["one", "two", "three"]
>>> inp_str = "This line not only contains one and two, but also three"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
...   print "Found all words in the list"
...
Found all words in the list
>>> inp_str = "This line not only contains one and two, but also threesome"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
...   print "Found all words in the list"
...
>>> inp_str = "This line not only contains one and two, but also four"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
...   print "Found all words in the list"
...
>>>

EDIT: As indicated in your comment, you seem to be looking for a dictionary instead:编辑:如您的评论所示,您似乎正在寻找字典:

>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str1))) for word in word_list)
{'three': True, 'two': True, 'one': True}
>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str2))) for word in word_list)
{'three': False, 'two': True, 'one': True}
>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str3))) for word in word_list)
{'three': False, 'two': True, 'one': True}

If the order isn't too important, you can use this approach如果顺序不太重要,可以使用这种方法

word_set = {"one", "two", "three"}
string_to_be_searched = "one two three"

for w in string_to_be_searched.split():
    if w in word_set:
         print("%s in string" % w)
         word_set.remove(w)

The .split() creates a list, which may be a problem for your 400k word string. .split()创建一个列表,这对于您的 400k 字串可能是一个问题。 But if you have enough RAM, you are done.但是如果你有足够的内存,你就完成了。

It's of course possible to modify the for loop to avoid creating the whole list.当然可以修改 for 循环以避免创建整个列表。 re.finditer or a generator using str.find are the obvious choices re.finditer或使用str.find的生成器是显而易见的选择

import re
word_set = {"one", "two", "three"}
string_to_be_searched = "one two three"

for match in re.finditer(r"\w+", string_to_be_searched):
    w = match.group()
    if w in word_set:
         print("%s in string" % w)
         word_set.remove(w)

Given your comment鉴于你的评论

I'm not actually looking for a single bool value, instead I'm looking for a dict mapping word to bool.我实际上并不是在寻找单个 bool 值,而是在寻找将单词映射到 bool 的字典。 Besides, I may need to run some test and see the performance of running re.search multiple times and run re.findall once.此外,我可能需要运行一些测试并查看多次运行 re.search 并运行一次 re.findall 的性能。 – yegle – 耶格尔

I would propose the following我会提出以下建议

import re
words = ['one', 'two', 'three']

def words_in_str(words, s):
    words = words[:]
    found = []
    for match in re.finditer('\w+', s):
        word = match.group()
        if word in words:
            found.append(word)
            words.remove(word)
            if len(words) == 0: break
    return found

assert words_in_str(words, 'three two one') == ['three', 'two', 'one']
assert words_in_str(words, 'one two. threesome') == ['one', 'two']
assert words_in_str(words, 'nothing of interest here one1') == []

This returns a list of words found in order, but you could easily modify it to return a dict{word:bool} as you desire.这将返回按顺序找到的单词列表,但您可以轻松修改它以根据需要返回dict{word:bool}

Advantages:好处:

  • stops searching through input string when all words are found找到所有单词后停止搜索输入字符串
  • removes a word form candidates once it is found一旦找到候选词,就删除它

Here's a simple generator that would be better for big strings, or a file, as I adapt it in the section below.这是一个简单的生成器,它更适合大字符串或文件,因为我在下面的部分对其进行了调整。

Note that this should be very fast, but it will continue for as long as the string continues without hitting all the words.请注意,这应该非常快,但只要字符串继续而不击中所有单词,它就会继续。 This came in second on Peter Gibson's benchmarking: Python: how to determine if a list of words exist in a string这在 Peter Gibson 的基准测试中排名第二: Python:如何确定字符串中是否存在单词列表

For a faster solution for shorter strings, see my other answer here: Python: how to determine if a list of words exist in a string有关较短字符串的更快解决方案,请参阅我的其他答案: Python:如何确定字符串中是否存在单词列表


Original Answer原答案

import re

def words_in_string(word_list, a_string):
    '''return iterator of words in string as they are found'''
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    for found_word in re.finditer(pattern, a_string):
        word = found_word.group(0)
        if word in word_set:
            word_set.discard(word)
            yield word
            if not word_set: # then we've found all words
                # break out of generator, closing file
                raise StopIteration 

It goes through the string yielding the words as it finds them, abandoning the search after it finds all the words, or if it reaches the end of the string.它在找到单词时遍历字符串,在找到所有单词后或到达字符串末尾时放弃搜索。

Usage:用法:

word_list = ['word', 'foo', 'bar']
a_string = 'A very pleasant word to you.'
for word in words_in_string(word_list, a_string):
    print word

word

EDIT: adaptation to use with a large file:编辑:适应与大文件一起使用:

Thanks to Peter Gibson for finding this the second fastest approach.感谢 Peter Gibson 发现这是第二快的方法。 I'm quite proud of the solution.我为解决方案感到非常自豪。 Since the best use-case for this is to go through a huge text stream, let me adapt the above function here to handle a file.由于最好的用例是通过一个巨大的文本流,让我在这里调整上面的函数来处理一个文件。 Do note that if words are broken on newlines this will not catch them, but neither would any of the other methods here.请注意,如果换行符上的单词被破​​坏,这不会捕获它们,但这里的任何其他方法也不会。

import re

def words_in_file(word_list, a_file_path):
    '''
    return a memory friendly iterator of words as they are found
    in a file.
    '''
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    with open(a_file_path, 'rU') as a_file:
        for line in a_file:
            for found_word in re.finditer(pattern, line):
                word = found_word.group(0)
                if word in word_set:
                    word_set.discard(word)
                    yield word
                    if not word_set: # then we've found all words
                        # break out of generator, closing file
                        raise StopIteration

To demonstrate, let's write some data:为了演示,让我们写一些数据:

file_path = '/temp/temp/foo.txt'
with open(file_path, 'w') as f:
    f.write('this\nis\nimportant\ndata')

and usage:和用法:

word_list = ['this', 'is', 'important']
iterator = words_in_file(word_list, file_path)

we now have an iterator, and if we consume it with a list:我们现在有一个迭代器,如果我们用一个列表来消费它:

list(iterator)

it returns:它返回:

['this', 'is', 'important']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 确定单词列表在Python中的字符串中是否顺序正确? - Determine if a list of words is in order in a String in Python? Python 字符串匹配 - 查找单词列表中的特定数量的单词是否存在于另一个列表中的句子中 - Python string matching - Find if certain number of words in a list of words exist in a sentence in another list Python - 如何检查列表中字符串中是否包含多个单词 - Python - How to check if multiple words in string in a list 从列表中存在的字符串中删除所有单词 - Remove all words from a string that exist in a list 如何插入和替换另一个列表中的单词列表或python中的字符串 - how to insert and replace a list of words in another list or a string in python 如何查找字符串中是否存在英语单词 - How to find if english words exist in string 确定句子中是否包含单词列表? - Determine if a list of words is in a sentence? 如何在Python中没有子字符串匹配的情况下将2列表中的单词与另一个单词字符串匹配? - How to match words in 2 list against another string of words without sub-string matching in Python? Python如何从列表中的字符串中删除小写单词 - Python how to delete lowercase words from a string that is in a list 如何获取python中字符串中特定单词旁边的单词列表 - how to get a list with words that are next to a specific word in a string in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM