在Python中一次迭代String字

Question

我有一個巨大的文本文件的字符串緩沖區。 我必須在字符串緩沖區中搜索給定的單詞/短語。 什么是有效的方法呢？

我嘗試使用re模塊匹配。 但由於我有一個巨大的文本語料庫，我必須搜索。 這需要花費大量時間。

給出單詞和短語詞典。

我遍歷每個文件，將其讀入字符串，搜索字典中的所有單詞和短語，並在找到鍵時增加字典中的計數。

我們認為的一個小優化是將短語/單詞的字典排序為最大單詞數。 然后比較字符串緩沖區中的每個單詞起始位置並比較單詞列表。 如果找到一個短語，我們不會搜索其他短語（因為它匹配最長的短語，這是我們想要的）

有人可以建議如何在字符串緩沖區中逐字逐句。 （逐字迭代字符串緩沖區）？

此外，還有其他優化可以做到嗎？

data = str(file_content)
for j in dictionary_entity.keys():
    cnt = data.count(j+" ")
    if cnt != -1:
        dictionary_entity[j] = dictionary_entity[j] + cnt
f.close()

Answer 1

通過文件的內容（在我的案例中來自Project Gutenberg的綠野仙蹤）逐字逐句地迭代，有三種不同的方式：

from __future__ import with_statement
import time
import re
from cStringIO import StringIO

def word_iter_std(filename):
    start = time.time()
    with open(filename) as f:
        for line in f:
            for word in line.split():
                yield word
    print 'iter_std took %0.6f seconds' % (time.time() - start)

def word_iter_re(filename):
    start = time.time()
    with open(filename) as f:
        txt = f.read()
    for word in re.finditer('\w+', txt):
        yield word
    print 'iter_re took %0.6f seconds' % (time.time() - start)

def word_iter_stringio(filename):
    start = time.time()
    with open(filename) as f:
        io = StringIO(f.read())
    for line in io:
        for word in line.split():
            yield word
    print 'iter_io took %0.6f seconds' % (time.time() - start)

woo = '/tmp/woo.txt'

for word in word_iter_std(woo): pass
for word in word_iter_re(woo): pass
for word in word_iter_stringio(woo): pass

導致：

% python /tmp/junk.py
iter_std took 0.016321 seconds
iter_re took 0.028345 seconds
iter_io took 0.016230 seconds

Answer 2

這聽起來像一個trie真正有用的問題。 您可能應該使用某種壓縮的trie，如Patricia / radix trie 。 只要你能夠在trie中找到你想要的整個單詞/短語詞典，這將大大減少時間復雜度。 如何工作是你取一個單詞的開頭並下降trie直到找到最長的匹配並遞增該節點中的計數器。 這可能意味着如果部分匹配沒有消失，你必須提升trie。 然后你將進入下一個單詞的開頭並再次進行。 trie的優點是你通過trie搜索整個字典（每個查找應該占用O（m），其中m是字典中單詞/短語的平均長度）。

如果你不能將整個字典整合到一個trie中，那么你可以將字典分成幾次嘗試（一個用於所有以al開頭的單詞/短語，一個用於mz）並在每個語料庫中掃描線索。

Answer 3

如果re模塊不能快速完成，那么你將很難以更快的速度完成它。 無論哪種方式，您都需要讀取整個文件。 您可以考慮修復正則表達式（可以提供一個嗎？）。 也許有一些關於你想要完成的事情的背景知識。

Answer 4

您可以嘗試反過來...而不是處理文本語料庫2,000,000次（每個單詞一次），只處理一次。 對於語料庫中的每個單詞，遞增哈希表或類似詞以存儲該詞的計數。 偽代碼中的一個簡單示例：

word_counts = new hash<string,int>
for each word in corpus:
  if exists(word_counts[word]):
    word_counts[word]++
  else:
    word_counts[word] = 1

您可以通過使用完整的單詞列表提前初始化word_counts來加快速度，這不需要if語句......不確定。

Answer 5

正如xyld所說，我不認為你可以超越re模塊的速度，雖然如果你發布你的正則表達式和可能的代碼也會有所幫助。 我可以添加的是在優化之前嘗試分析。 當您看到大部分處理過程時，您可能會感到非常驚訝。 我使用hotshot來分析我的代碼，我很滿意。 你可以在http://onlamp.com/pub/a/python/2005/12/15/profiling.html找到python profiling的一個很好的介紹。

Answer 6

如果使用re不夠findall() ，您可能正在使用findall() ，或者手動逐個查找匹配項。 使用迭代器可能會使它更快：

>>> for i in re.finditer(r'\w+', 'Hello, this is a sentence.'):
...     print i.group(0)
...     
Hello
this
is
a
sentence

Answer 7

#!/usr/bin/env python
import re

s = ''
for i in xrange(0, 100000):
    s = s + 'Hello, this is a sentence. '
    if i == 50000:
        s = s + " my phrase "

s = s + 'AARRGH'

print len(s)

itr = re.compile(r'(my phrase)|(\w+)').finditer(s)
for w in itr:
    if w.group(0) == 'AARRGH':
        print 'Found AARRGH'
    elif w.group(0) == "my phrase":
        print 'Found "my phrase"'

運行這個，我們得到

$ time python itrword.py
2700017
Found "my phrase"
Found AARRGH

real    0m0.616s
user    0m0.573s
sys     0m0.033s

但是，明確添加到正則表達式中的每個“短語”都會對性能產生影響 - 通過粗略測量，上述速度比使用“\\ w +”慢50％。

Answer 8

您是否考慮過使用自然語言工具包？ 它包含許多用於處理文本語料庫的好函數，還有一個很酷的FreqDist類，它具有類似dict（有鍵）和類似列表（slice）的行為。

在Python中一次迭代String字

問題描述

8 個解決方案

解決方案1
7 已采納 2010-05-04 21:56:40

解決方案2
1 2010-05-04 21:06:43

解決方案3
0 2010-05-04 20:14:00

解決方案4
0 2010-05-04 20:19:52

解決方案5
0 2010-05-04 20:21:42

解決方案6
0 2010-05-04 20:23:11

解決方案7
0 2010-05-04 21:16:03

解決方案8
0 2010-05-05 00:37:39

在Python中一次迭代String字

問題描述

8 個解決方案

解決方案1 7 已采納 2010-05-04 21:56:40

解決方案2 1 2010-05-04 21:06:43

解決方案3 0 2010-05-04 20:14:00

解決方案4 0 2010-05-04 20:19:52

解決方案5 0 2010-05-04 20:21:42

解決方案6 0 2010-05-04 20:23:11

解決方案7 0 2010-05-04 21:16:03

解決方案8 0 2010-05-05 00:37:39

解決方案1
7 已采納 2010-05-04 21:56:40

解決方案2
1 2010-05-04 21:06:43

解決方案3
0 2010-05-04 20:14:00

解決方案4
0 2010-05-04 20:19:52

解決方案5
0 2010-05-04 20:21:42

解決方案6
0 2010-05-04 20:23:11

解決方案7
0 2010-05-04 21:16:03

解決方案8
0 2010-05-05 00:37:39