使用python在大文本文件中搜索字符串的快速方法

Question

那是我目前的狀況：

我有一個約250k字符串的2.5MB文本文件，按字母順序排序
每個字符串都是唯一的
我不需要修改文本文件中的條目：一旦加載了文本文件，就永遠不會對其進行編輯
文本文件在開始時加載，然后我只需要通過它搜索字符串

最后一點是問題。 實際上，我需要搜索字符串的完全匹配和部分匹配。 我剛剛編寫的算法涉及使用正則表達式，並結合了一些嘗試使過程更快的嘗試：例如，我將識別了字母單數字母的字典索引硬編碼到腳本中，然后拆分了大文本文件小說改編成26本較小的詞典。 那完全沒用，腳本仍然非常慢。 我在這里瀏覽了一些帖子，被說服嘗試了mmap：但是，給定一個正則表達式，查找所有部分匹配似乎沒用。 最終我得出一個結論，即特里可解決我的問題，盡管我幾乎不知道這是什么。 我應該嘗試嗎？ 如果是這樣，我應該如何繼續在python中創建trie？ marisa-trie模塊好嗎？ 謝謝大家

編輯：“部分匹配”，我的意思是我有一個字符串的前綴。 我不需要比賽的結尾或中間，而只是開始。

Answer 1

最簡單，最快的解決方案：

#!/usr/bin/env python

d = {}

# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
    line = line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        d[line[:i]] = True
f.close()


while True:
    w = raw_input('> ')
    if not w:
        break

    if w in d:
        print "match found", w

這里稍微復雜一點，但是內存效率高：

#!/usr/bin/env python

d = []

def binary_search(a, x, lo=0, hi=None):
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        midval = a[mid]
        if midval < x:
            lo = mid+1
        elif midval > x:
            hi = mid
        else:
            return mid
    return -1


f = open("/etc/hosts","r")
for line in f:
    line=line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        x = hash(line[:i])
        d.append(x)
f.close()

d.sort()

while True:
    w = raw_input('> ')
    if not w:
        break

    if binary_search(d, hash(w)) != -1:
        print "match found", w

Answer 2

由於文件已經排序並可以讀入，因此可以在文件上使用二進制搜索，而無需訴諸任何奇特的數據結構。 Python內置了一個二進制搜索功能bisect.bisect_left` 。

Answer 3

使用特里。

#dictionary is a list of words
def parse_dictionary(dictionary):
    dictionary_trie = {}
    for word in dictionary:
        tmp_trie = dictionary_trie
        for letter in word:
            if letter not in tmp_trie:
                tmp_trie[letter] = {}
            if 'words' not in tmp_trie[letter]:
                tmp_trie[letter]['words'] = []

            tmp_trie[letter]['words'].append(word)
            tmp_trie = tmp_trie[letter]
    return dictionary_trie

def matches(substring, trie):
    d = trie
    for letter in substring:
        try:
            d = d[letter]
        except KeyError:
            return []
    return d['words']

用法示例：

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
       'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
                   'words': ['hello']},
             'words': ['hello']},
       'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
                                     'words': ['testing']},
                               'words': ['testing']},
                         'words': ['test', 'testing']},
                   'words': ['test', 'testing']},
             'words': ['test', 'testing']},
       'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
                   'words': ['world']},
             'words': ['world']},
       'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>>

Answer 4

您可以創建一個列表，讓每一行成為列表的一個元素，然后進行二進制搜索。

Answer 5

使用特里樹仍然需要您構建特里樹，即O（n）來迭代整個文件-利用排序的優勢使其成為O（log_2 n）。 因此，這種更快的解決方案將使用二進制搜索（請參見下文）。

此解決方案仍然需要您讀取整個文件。 在更快的解決方案中，您可以預處理文件並填充所有行，以使它們的長度相同（或在文件中構建某種索引結構，以使查找到列表的中間變得可行）- -然后搜索到文件的中間位置將帶您到列表的中間。 僅對於一個非常大的文件（千兆字節或幾百兆字節），才可能需要“甚至更快”的解決方案。 您可以將它們與二進制搜索結合起來。

可能的是，如果文件系統支持稀疏文件，則執行上述填充方案不會增加磁盤上實際使用的文件塊。

然后，在那時，您可能正在采用b樹或b + tree實現以使索引有效。 因此，您可以使用b樹庫。

像這樣：

import bisect

entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]

def find_matches(ls, m):

    x = len(ls) / 2
    match_index = -1

    index = bisect.bisect_left(ls, m)
    matches = []

    while ls[index].startswith(m):
        matches.append(ls[index])
        index += 1

    return matches

print find_matches(entries, "c")

輸出：

>>> ['c', 'cc', 'cd', 'ce']

Answer 6

因此，要解釋arainchi的非常好的答案，請為文件中的每一行制作一個字典，並帶有一個條目。 然后，您可以將搜索字符串與這些條目的名稱進行匹配。 字典對於此類搜索非常方便。

使用python在大文本文件中搜索字符串的快速方法

問題描述

6 個解決方案

解決方案1
5 已采納 2013-02-22 23:15:02

解決方案2
2 2013-02-22 23:15:21

解決方案3
1 2013-02-22 23:12:49

解決方案4
0 2013-02-22 23:14:43

解決方案5
0 2013-02-22 23:33:03

解決方案6
0 2013-02-22 23:37:09

使用python在大文本文件中搜索字符串的快速方法

問題描述

6 個解決方案

解決方案1 5 已采納 2013-02-22 23:15:02

解決方案2 2 2013-02-22 23:15:21

解決方案3 1 2013-02-22 23:12:49

解決方案4 0 2013-02-22 23:14:43

解決方案5 0 2013-02-22 23:33:03

解決方案6 0 2013-02-22 23:37:09

解決方案1
5 已采納 2013-02-22 23:15:02

解決方案2
2 2013-02-22 23:15:21

解決方案3
1 2013-02-22 23:12:49

解決方案4
0 2013-02-22 23:14:43

解決方案5
0 2013-02-22 23:33:03

解決方案6
0 2013-02-22 23:37:09