简体   繁体   English

Python - 在字符串中查找字符串列表的出现

[英]Python - find occurrences of list of strings within string

I have a large string and a list of search strings and want to build a boolean list indicating whether or not each of the search strings exists in the large string. 我有一个大字符串和一个搜索字符串列表,并希望构建一个布尔列表,指示每个搜索字符串是否存在于大字符串中。 What is the fastest way to do this in Python? 在Python中执行此操作的最快方法是什么?

Below is a toy example using a naive approach, but I think it's likely there's a more efficient way of doing this. 下面是一个使用天真方法的玩具示例,但我认为这可能是一种更有效的方法。

eg the example below should return [1, 1, 0] since both "hello" and "world" exist in the test string. 例如,下面的示例应返回[1,1,0],因为测试字符串中存在“hello”和“world”。

def check_strings(search_list, input):
output = []
for s in search_list:
    if input.find(s) > -1:
        output.append(1)
    else:
        output.append(0)
return output

search_strings = ["hello", "world", "goodbye"] test_string = "hello world" print(check_strings(search_strings, test_string))

I can't say if this is the fastest , (this is still O(n*m)), but this is the way I would do it: 我不能说这是否是最快的 (这仍然是O(n * m)),但这是我这样做的方式:

def check_strings(search_list, input_string):
    return [s in input_string for s in search_list]

The following program might be faster, or not. 以下程序可能更快或更快。 It uses a regular expression to make one pass through the input string. 它使用正则表达式来传递一个输入字符串。 Note that you may you may want to use re.escape(i) in the re.findall() expression, or not, depending upon your needs. 请注意,您可能希望在re.findall()表达式中使用re.escape(i) ,具体取决于您的需要。

def check_strings_re(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]

Here is a complete test program: 这是一个完整的测试程序:

def check_strings(search_list, input_string):
    return [s in input_string for s in search_list]


def check_strings_re(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]


search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
assert check_strings(search_strings, test_string) == [True, True, False]
assert check_strings_re(search_strings, test_string) == [True, True, False]

An implementation using the Aho Corasick algorithm ( https://pypi.python.org/pypi/pyahocorasick/ ), which uses a single pass through the string: 使用Aho Corasick算法( https://pypi.python.org/pypi/pyahocorasick/ )的实现,它使用单个字符串传递:

import ahocorasick
import numpy as np

def check_strings(search_list, input):
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
print(check_strings(search_strings, test_string))

I post it just for comparison. 我发布它只是为了比较。 My comparing code: 我的比较代码:

#!/usr/bin/env python3
def gettext():
    from os import scandir
    l = []
    for file in scandir('.'):
        if file.name.endswith('.txt'):
            l.append(open(file.name).read())
    return ' '.join(l)

def getsearchterms():
    return list(set(open('searchterms').read().split(';')))

def rob(search_string, input_string):
    import re
    return [any(l)
            for l in
            zip(*re.findall('|'.join('('+i+')' for i in search_string),
                            input_string))]

def blotosmetek(search_strings, input_string):
    import re
    regexp = re.compile('|'.join([re.escape(x) for x in search_strings]))
    found = set(regexp.findall(input_string))
    return [x in found for x in search_strings]

def ahocorasick(search_list, input):
    import ahocorasick
    import numpy as np
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

def naive(search_list, text):
    return [s in text for s in search_list]

def test(fn, args):
    start = datetime.now()
    ret = fn(*args)
    end = datetime.now()
    return (end-start).total_seconds()

if __name__ == '__main__':
    from datetime import datetime
    text = gettext()
    print("Got text, total of", len(text), "characters")
    search_strings = getsearchterms()
    print("Got search terms, total of", len(search_strings), "words")

    fns = [ahocorasick, blotosmetek, naive, rob]
    for fn in fns:
        r = test(fn, [search_strings, text])
        print(fn.__name__, r*1000, "ms")

I used different words that appear in Leviathan as search terms and concatenated 25 most downloaded books from Project Gutenberg as search string. 我使用Leviathan中出现的不同单词作为搜索词,并将来自Project Gutenberg的25个最常下载的书籍连接为搜索字符串。 Results are as follows: 结果如下:

Got text, total of 18252025 characters
Got search terms, total of 12824 words
ahocorasick 3824.111 milliseconds
Błotosmętek 360565.542 milliseconds
naive 73765.67 ms

Robs version runs already for about an hour and still doesn't finish. Robs版本已经运行了大约一个小时但仍未完成。 Maybe it's broken, maybe it's simply painfully slow. 也许它已经坏了,也许它只是非常缓慢。

My version using regular expressions: 我的版本使用正则表达式:

def check_strings(search_strings, input_string):
    regexp = re.compile('|'.join([re.escape(x) for x in search_strings]))
    found = set(regexp.findall(input_string))
    return [x in found for x in search_strings]

On the test data provided by original poster it is by an order of magnitude slower than Rob's pretty solution, but I'm going to do some benchmarking on a bigger sample. 在原始海报提供的测试数据上,它比Rob的漂亮解决方案慢了一个数量级,但我将对更大的样本做一些基准测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在字符串中查找所有出现的字符串列表并在python中返回int列表? - How to find all occurrences of a list of strings in a string and return a list of list of int in python? 使用 Python 查找字符串中列表的出现次数 - Find number of occurrences of a list in a string using Python Python - 查找字符串中第一次出现的字符串列表的索引位置 - Python - find index position of first occurrence of a list of strings within a string 在Python中查找字符串中所有事件的开始和结束位置 - Find start and end positions of all occurrences within a string in Python 在字符串列表中,找到字符串中的短语,并将字符串中的两个整数(x..y)追加到list。 蟒蛇 - In a list of strings, find a phrase within the string and append two integers (x..y) in string to a list . Python 字符串列表python中字符串的长度 - Length of string within a list of strings python 在python语言的给定字符串列表中查找所有出现的字符 - Find all the occurrences of a character in a given list of string in python language 字符串列表中出现的字符串的双重列表理解 - Double list comprehension for occurrences of a string in a list of strings 字符串列表中的Python频繁出现列表 - Python list of frequent occurrences in a list of strings 在Python中查找多个字符串出现 - find multiple string occurrences in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM