從文本中提取行的另一種方法（python-regex）

Question

我正在尋找一種從python中的相當大的數據庫中提取行的方法。 我只需要保留那些包含我的關鍵字之一的關鍵字。 我以為可以使用正則表達式解決問題，所以我將下面的代碼放在一起。 不幸的是，這給了我一些錯誤（也許是由於我的關鍵詞確實很大，幾乎是500個），這些關鍵詞寫在文件listtosearch.txt中的不同行中。

import re
data = open('database.txt').read() 
fileout = open("fileout.txt","w+")

with open('listtosearch.txt', 'r') as f:
    keywords = [line.strip() for line in f]

pattern = re.compile('|'.join(keywords))

for line in data:
    if pattern.search(line):
        fileout.write(line)

我還嘗試過使用雙循環（在關鍵字列表和數據庫行中），但運行時間太長。

我得到的錯誤是：

Traceback (most recent call last):
  File "/usr/lib/python2.7/re.py", line 190, in compile 
    return _compile(pattern, flags)   
  File "/usr/lib/python2.7/re.py", line 240, in _compile 
    p = sre_compile.compile(pattern, flags) 
  File "/usr/lib/python2.7/sre_compile.py", line 511, in compile 
    "sorry, but this version only supports 100 named groups" 
AssertionError: sorry, but this version only supports 100 named groups

有什么建議嗎？ 謝謝

Answer 1

您可能需要看一下Aho–Corasick字符串匹配算法。 在這里可以找到可用的python實現。

該模塊的簡單用法示例：

from pyahocorasick import Trie

words = ['foo', 'bar']

t = Trie()
for w in words:
    t.add_word(w, w)
t.make_automaton()

print [a for a in t.iter('my foo is a bar')]

>> [(5, ['foo']), (14, ['bar'])]

集成到代碼中應該很簡單。

Answer 2

首先，我很確定您的意思是data = open('database.txt').readlines()而不是read() 。 否則， data將是字符串而不是行列表，並且for line in data行將毫無意義。

此時，您實際上正在尋找按關鍵字建立索引的解決方案，而單純的搜索將不再足夠有效，無法為您提供及時的結果。

確實沒有另一種方法可以顯着提高效率或降低復雜性。 您將不得不磨牙並接受瀏覽整個數據庫的成本。

另外，如果數據庫完全適合內存，那么數據庫就不可能那么大：)

就是說，還有其他一些方法可能會更有效：

將您的關鍵字放在集合中，然后將輸入數據標記為單詞，然后在集合中全部查找它們：

 data = open('database.txt').readlines() fileout = open("fileout.txt","w+") with open('listtosearch.txt', 'r') as f: keywords = [line.strip() for line in f] keywords = set(keywords) for line in data: # You might have to be smarter about splitting the line to # take things like punctuation into consideration. for word in line.split(): if word in keywords: fileout.write(line) break

這是一個考慮標點符號的分詞示例。

Answer 3

這是我的代碼：

import re
data = open('database.txt', 'r')
fileout = open("fileout.txt","w+")

with open('listtosearch.txt', 'r') as f:
    keywords = [line.strip() for line in f]

# one big pattern can take time to match, so you have a list of them
patterns = [re.compile(keyword) for keyword in keywords]

for line in data:

    for pattern in patterns:
        if not pattern.search(line):
            break
    else:
        fileout.write(line)

我使用以下文件進行了測試：

database.txt

"Name jhon" (1995)
"Name foo" (2000)
"Name fake" (3000)
"Name george" (2000)
"Name george" (2500)

listtosearch.txt

"Name (george)"
\(2000\)

這就是我在fileout.txt中得到的

"Name george" (2000)

因此，這也應該在您的計算機上正常工作。

Answer 4

可能不是有效的解決方案，但請嘗試使用set及其相交屬性。

from_db = tuple([line.rstrip("\n") for line in open('database.txt') if line.rstrip('\n')])
keywords = set([line.rstrip("\n") for line in open('listtosearch.txt') if line.rstrip('\n')])
with open("output_file.txt", "w") as fp:
    for line in from_db:
        line_set = set(line.split(" "))
        if line_set.intersection(keywords):
            fp.write(line + "\n")

交集將檢查任何常見的字符串。 由於比較了哈希值，我想搜索會更快，而不是一次又一次地遍歷整個列表。

從文本中提取行的另一種方法（python-regex）

問題描述

4 個解決方案

解決方案1
2 2013-06-27 12:26:46

解決方案2
1 2013-06-27 10:44:35

解決方案3
1 已采納 2013-06-27 11:29:25

解決方案4
1 2013-06-27 13:14:43

從文本中提取行的另一種方法（python-regex）

問題描述

4 個解決方案

解決方案1 2 2013-06-27 12:26:46

解決方案2 1 2013-06-27 10:44:35

解決方案3 1 已采納 2013-06-27 11:29:25

解決方案4 1 2013-06-27 13:14:43

解決方案1
2 2013-06-27 12:26:46

解決方案2
1 2013-06-27 10:44:35

解決方案3
1 已采納 2013-06-27 11:29:25

解決方案4
1 2013-06-27 13:14:43