简体   繁体   中英

Fast method to search for a string in a big text file with python

That's my current situation:

  • I have a 2.5MB text file with around 250k strings, sorted alphabetically
  • Each string is unique
  • I don't need to modify the entries in the text file: once the text file is loaded, it is never edited
  • The text file is loaded at start and then I just need to search for strings through it

The last point is the problem. Actually I need to search for complete match AND for partial match of the string. The algorithm I wrote just involved the use of regular expressions combined with some attempts to make the process faster: for example, I hardcoded into my script the indexes of the dictionary that identified the singular letters of the alphabet, and then split the big text file fictionary into 26 smaller dictionary. That was totally useless, the script is still incredibly slow. Skimming some posts here, I was convinced to try mmap: but it looked useless to find all the partial matches, given a regular expression. Eventually I came to the conclusion that a trie may solve my problem, though i hardly know what is this. Should I go with tries? If so, how should I proceed to the creation of a trie in python? Is marisa-trie module good? Thanks to everybody

EDIT: By "partial match", I mean that I have the prefix of a string. I do not need matches at the end or in the middle, just at the beginning.

Easiest and fastest solution:

#!/usr/bin/env python

d = {}

# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
    line = line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        d[line[:i]] = True
f.close()


while True:
    w = raw_input('> ')
    if not w:
        break

    if w in d:
        print "match found", w

Here is slightly more complex, but memory efficient one:

#!/usr/bin/env python

d = []

def binary_search(a, x, lo=0, hi=None):
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        midval = a[mid]
        if midval < x:
            lo = mid+1
        elif midval > x:
            hi = mid
        else:
            return mid
    return -1


f = open("/etc/hosts","r")
for line in f:
    line=line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        x = hash(line[:i])
        d.append(x)
f.close()

d.sort()

while True:
    w = raw_input('> ')
    if not w:
        break

    if binary_search(d, hash(w)) != -1:
        print "match found", w

Since the file is already sorted and read in, you can use a binary search on it without needing to resort to any fancy data structures. Python has a binary search function built in, bisect.bisect_left` .

Use a trie .

#dictionary is a list of words
def parse_dictionary(dictionary):
    dictionary_trie = {}
    for word in dictionary:
        tmp_trie = dictionary_trie
        for letter in word:
            if letter not in tmp_trie:
                tmp_trie[letter] = {}
            if 'words' not in tmp_trie[letter]:
                tmp_trie[letter]['words'] = []

            tmp_trie[letter]['words'].append(word)
            tmp_trie = tmp_trie[letter]
    return dictionary_trie

def matches(substring, trie):
    d = trie
    for letter in substring:
        try:
            d = d[letter]
        except KeyError:
            return []
    return d['words']

Usage example:

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
       'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
                   'words': ['hello']},
             'words': ['hello']},
       'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
                                     'words': ['testing']},
                               'words': ['testing']},
                         'words': ['test', 'testing']},
                   'words': ['test', 'testing']},
             'words': ['test', 'testing']},
       'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
                   'words': ['world']},
             'words': ['world']},
       'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>> 

您可以创建一个列表,让每一行成为列表的一个元素,然后进行二进制搜索。

Using a trie still requires you to build a trie, which is O(n) to iterate the whole file-- taking advantage of the sorting will make it O(log_2 n). So, this faster solution would use a binary search (see below).

This solution still requires you to read in the entire file. In an even faster solution, you could pre-process the file and pad out all the lines so they're the same length (or build some kind of index structure in the file, to make seeking to the middle of the list feasible)-- then seeking to middle of the file would take you to the middle of the list. The "even faster" solution would probably only be needed for a really, really big file (gigabytes or hundreds of megabytes). You'd them combine this with the binary search.

Possibly, if the file-system supports sparse files -- doing the above padding scheme won't event increase the files actual blocks used on the disk.

Then, at that point, you're probably approaching a b-tree or a b+tree implementation to make the indexing efficient. So you could use a b-tree library .

Something like this:

import bisect

entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]

def find_matches(ls, m):

    x = len(ls) / 2
    match_index = -1

    index = bisect.bisect_left(ls, m)
    matches = []

    while ls[index].startswith(m):
        matches.append(ls[index])
        index += 1

    return matches

print find_matches(entries, "c")

Output:

>>> ['c', 'cc', 'cd', 'ce']

So to explain arainchi's very nice answer, make a dictionary with an entry for every line in your file. Then you can match your search string against the names of those entries. Dictionaries are really handy for this kind of searching.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM