在Python中一次迭代String字

Question

我有一个巨大的文本文件的字符串缓冲区。 我必须在字符串缓冲区中搜索给定的单词/短语。 什么是有效的方法呢？

我尝试使用re模块匹配。 但由于我有一个巨大的文本语料库，我必须搜索。 这需要花费大量时间。

给出单词和短语词典。

我遍历每个文件，将其读入字符串，搜索字典中的所有单词和短语，并在找到键时增加字典中的计数。

我们认为的一个小优化是将短语/单词的字典排序为最大单词数。 然后比较字符串缓冲区中的每个单词起始位置并比较单词列表。 如果找到一个短语，我们不会搜索其他短语（因为它匹配最长的短语，这是我们想要的）

有人可以建议如何在字符串缓冲区中逐字逐句。 （逐字迭代字符串缓冲区）？

此外，还有其他优化可以做到吗？

data = str(file_content)
for j in dictionary_entity.keys():
    cnt = data.count(j+" ")
    if cnt != -1:
        dictionary_entity[j] = dictionary_entity[j] + cnt
f.close()

Answer 1

通过文件的内容（在我的案例中来自Project Gutenberg的绿野仙踪）逐字逐句地迭代，有三种不同的方式：

from __future__ import with_statement
import time
import re
from cStringIO import StringIO

def word_iter_std(filename):
    start = time.time()
    with open(filename) as f:
        for line in f:
            for word in line.split():
                yield word
    print 'iter_std took %0.6f seconds' % (time.time() - start)

def word_iter_re(filename):
    start = time.time()
    with open(filename) as f:
        txt = f.read()
    for word in re.finditer('\w+', txt):
        yield word
    print 'iter_re took %0.6f seconds' % (time.time() - start)

def word_iter_stringio(filename):
    start = time.time()
    with open(filename) as f:
        io = StringIO(f.read())
    for line in io:
        for word in line.split():
            yield word
    print 'iter_io took %0.6f seconds' % (time.time() - start)

woo = '/tmp/woo.txt'

for word in word_iter_std(woo): pass
for word in word_iter_re(woo): pass
for word in word_iter_stringio(woo): pass

导致：

% python /tmp/junk.py
iter_std took 0.016321 seconds
iter_re took 0.028345 seconds
iter_io took 0.016230 seconds

Answer 2

这听起来像一个trie真正有用的问题。 您可能应该使用某种压缩的trie，如Patricia / radix trie 。 只要你能够在trie中找到你想要的整个单词/短语词典，这将大大减少时间复杂度。 如何工作是你取一个单词的开头并下降trie直到找到最长的匹配并递增该节点中的计数器。 这可能意味着如果部分匹配没有消失，你必须提升trie。 然后你将进入下一个单词的开头并再次进行。 trie的优点是你通过trie搜索整个字典（每个查找应该占用O（m），其中m是字典中单词/短语的平均长度）。

如果你不能将整个字典整合到一个trie中，那么你可以将字典分成几次尝试（一个用于所有以al开头的单词/短语，一个用于mz）并在每个语料库中扫描线索。

Answer 3

如果re模块不能快速完成，那么你将很难以更快的速度完成它。 无论哪种方式，您都需要读取整个文件。 您可以考虑修复正则表达式（可以提供一个吗？）。 也许有一些关于你想要完成的事情的背景知识。

Answer 4

您可以尝试反过来...而不是处理文本语料库2,000,000次（每个单词一次），只处理一次。 对于语料库中的每个单词，递增哈希表或类似词以存储该词的计数。 伪代码中的一个简单示例：

word_counts = new hash<string,int>
for each word in corpus:
  if exists(word_counts[word]):
    word_counts[word]++
  else:
    word_counts[word] = 1

您可以通过使用完整的单词列表提前初始化word_counts来加快速度，这不需要if语句......不确定。

Answer 5

正如xyld所说，我不认为你可以超越re模块的速度，虽然如果你发布你的正则表达式和可能的代码也会有所帮助。 我可以添加的是在优化之前尝试分析。 当您看到大部分处理过程时，您可能会感到非常惊讶。 我使用hotshot来分析我的代码，我很满意。 你可以在http://onlamp.com/pub/a/python/2005/12/15/profiling.html找到python profiling的一个很好的介绍。

Answer 6

如果使用re不够findall() ，您可能正在使用findall() ，或者手动逐个查找匹配项。 使用迭代器可能会使它更快：

>>> for i in re.finditer(r'\w+', 'Hello, this is a sentence.'):
...     print i.group(0)
...     
Hello
this
is
a
sentence

Answer 7

#!/usr/bin/env python
import re

s = ''
for i in xrange(0, 100000):
    s = s + 'Hello, this is a sentence. '
    if i == 50000:
        s = s + " my phrase "

s = s + 'AARRGH'

print len(s)

itr = re.compile(r'(my phrase)|(\w+)').finditer(s)
for w in itr:
    if w.group(0) == 'AARRGH':
        print 'Found AARRGH'
    elif w.group(0) == "my phrase":
        print 'Found "my phrase"'

运行这个，我们得到

$ time python itrword.py
2700017
Found "my phrase"
Found AARRGH

real    0m0.616s
user    0m0.573s
sys     0m0.033s

但是，明确添加到正则表达式中的每个“短语”都会对性能产生影响 - 通过粗略测量，上述速度比使用“\\ w +”慢50％。

Answer 8

您是否考虑过使用自然语言工具包？ 它包含许多用于处理文本语料库的好函数，还有一个很酷的FreqDist类，它具有类似dict（有键）和类似列表（slice）的行为。

在Python中一次迭代String字

问题描述

8 个解决方案

解决方案1
7 已采纳 2010-05-04 21:56:40

解决方案2
1 2010-05-04 21:06:43

解决方案3
0 2010-05-04 20:14:00

解决方案4
0 2010-05-04 20:19:52

解决方案5
0 2010-05-04 20:21:42

解决方案6
0 2010-05-04 20:23:11

解决方案7
0 2010-05-04 21:16:03

解决方案8
0 2010-05-05 00:37:39

在Python中一次迭代String字

问题描述

8 个解决方案

解决方案1 7 已采纳 2010-05-04 21:56:40

解决方案2 1 2010-05-04 21:06:43

解决方案3 0 2010-05-04 20:14:00

解决方案4 0 2010-05-04 20:19:52

解决方案5 0 2010-05-04 20:21:42

解决方案6 0 2010-05-04 20:23:11

解决方案7 0 2010-05-04 21:16:03

解决方案8 0 2010-05-05 00:37:39

解决方案1
7 已采纳 2010-05-04 21:56:40

解决方案2
1 2010-05-04 21:06:43

解决方案3
0 2010-05-04 20:14:00

解决方案4
0 2010-05-04 20:19:52

解决方案5
0 2010-05-04 20:21:42

解决方案6
0 2010-05-04 20:23:11

解决方案7
0 2010-05-04 21:16:03

解决方案8
0 2010-05-05 00:37:39