简体   繁体   中英

fastest way to find one of several substrings in string

I'm doing a lot of file processing where I look for one of several substrings in each line. So I have code equivalent to this:

with open(file) as infile:
    for line in infile:
        for key in MY_SUBSTRINGS:
            if key in line:
                print(key, line)

MY_SUBSTRINGS is a list of 6-20 substrings. Substrings vary in length 10-30 chars and may contain spaces.

I'd really like to find a much faster way of doing this. Files have many 100k lines in them. Lines are typically 150 chars. User has to wait for 30s to a minute while file processes. The above is not the only thing taking time but it's taking quite a lot. I'm doing various other processes on a line-by-line basis so not appropraite to search the whole file as once.

I've tried the regex and ahocorasick answers from here but they both come out slower in my tests:

Fastest way to check whether a string is a substring in a list of strings

Any suggestions for faster methods?

I'm not quite sure of the best way to share example datasets. A logcat off an Android phone would be an example. One that's at least 200k lines long.

Then search for 10 strings like:

(NL80211_CMD_TRIGGER_SCAN) received for

Trying to associate with

Request to deauthenticate

interface state UNINITIALIZED->ENABLED


I tried regexes like this:

match_str = "|".join(MY_SUBSTRINGS)
regex = re.compile(match_str)

with open(file) as infile:
    for line in infile:
        match = regex.search(line)
        if match:
            print(match.group(0))

I would build a regular expression to search through the file.

Make sure that you're not running each of the search terms in loops when you use regex.

If each of your expressions are in one regexp it would look something like this:

import re

line = 'fsjdk abc def abc jkl'
re.findall(r'(abc|def)', line)

https://docs.python.org/3/library/re.html

If you need to to run still faster consider running a process concurrently with threads. This is a much broader topic but one method that might work is to first take a look at your problem and consider what the bottleneck might be.

If the issue is that your look is starved for disk throughput on the read what you can do is first run through the file and split it up into chunks and then map those chunks to worker threads that can process the data like a queue.

Definitely would need some more on your problem to understand exactly what kind of issue you're looking to solve. And there's people here that definitely would love to dig into a challenge.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM