解析文本文件中的唯一单词

Question

我正在开发一个项目来解析大量文本文件中的独特单词。 我有文件处理，但我正在尝试改进解析过程。 每个文件都有一个特定的文本段，以我在实时系统上使用正则表达式捕获的某些短语结尾。

解析器应遍历每一行，并根据3个条件检查每个单词：

超过两个字符
不在预定义的字典集dict_file
尚未出现在单词列表中

结果应该是2D数组，每行包含每个文件的唯一字列表，在处理.writerow(foo)每个文件后使用.writerow(foo)方法将其写入CSV文件。

我的工作代码在下面，但它很慢而且很笨拙，我错过了什么？

我的生产系统只使用默认模块运行2.5.1（因此NLTK是禁止的），无法升级到2.7+。

def process(line):
    line_strip = line.strip()
    return line_strip.translate(punct, string.punctuation)

# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
    for line in report:
        # Strip out the CR/LF and punctuation from the input line
        line_check = process(line)
        if line_check == "FOOTNOTES":
            break
        for word in line_check.split():
            word_check = word.lower()
            if ((word_check not in report_set) and (word_check not in dict_file) 
                 and (len(word) > 2)):
                report_set.append(word_check)
report_list = list(report_set)

编辑：根据steveha的建议更新了我的代码。

Answer 1

一个问题是，一个in测试的list是缓慢的。 您应该保留一set以跟踪您所看到的单词，因为对于set的in测试非常快。

例：

report_set = set()
for line in report:
    for word in line.split():
        if we_want_to_keep_word(word):
            report_set.add(word)

完成后：report_list = list（report_set）

任何时候你需要强制一个set到list ，你可以。 但是如果你只是需要循环它或者in测试中做，你可以把它作为一set ; for x in report_set:

另一个可能或可能不重要的问题是，您使用.readlines()方法一次性地从文件中.readlines()所有行。 对于非常大的文件，最好只使用open file-handle对象作为迭代器，如下所示：

with open("filename", "r") as f:
    for line in f:
        ... # process each line here

一个大问题是我甚至没有看到这段代码如何工作：

while 1:
    lines = report.readlines()
    if not lines:
        break

这将永远循环。 第一个语句使用.readlines()覆盖所有输入行，然后我们再次循环，然后下一次调用.readlines() report已经用尽，因此对.readlines()的调用返回一个空列表，该列表突破了无限循环。 但是现在这已经丢失了我们刚读过的所有行，其余的代码必须使用空lines变量。 这怎么工作？

因此， while 1循环中摆脱整个，并for line in report:中将下一循环更改为for line in report:

此外，您实际上不需要保留count变量。 您可以随时使用len(report_set)来查找set有多少单词。

另外，使用一个set您实际上不需要检查单词是否in集合中; 您可以随时调用report_set.add(word) ，如果它已经在set ，则不会再次添加！

此外，您不必我的方式去做，但我喜欢做一个发电机，做所有的处理。 剥离线条，平移线条，拆分空白区域，并准备好使用的单词。 我也会强制说小写，但我不知道只有大写才能检测到FOOTNOTES是否重要。

所以，把以上所有内容放在一起，你得到：

def words(file_object):
    for line in file_object:
        line = line.strip().translate(None, string.punctuation)
        for word in line.split():
            yield word

report_set = set()
with open(fullpath, 'r') as report:
    for word in words(report):
        if word == "FOOTNOTES":
            break
        word = word.lower()
        if len(word) > 2 and word not in dict_file:
            report_set.add(word)

print("Words in report_set: %d" % len(report_set))

Answer 2

尝试使用字典或集替换report_list。 如果report_list是列表，则word_check不在report_list中工作很慢

解析文本文件中的唯一单词

问题描述

2 个解决方案

解决方案1
3 已采纳 2012-06-20 01:09:01

解决方案2
2 2012-06-20 01:08:57

解析文本文件中的唯一单词

问题描述

2 个解决方案

解决方案1 3 已采纳 2012-06-20 01:09:01

解决方案2 2 2012-06-20 01:08:57

解决方案1
3 已采纳 2012-06-20 01:09:01

解决方案2
2 2012-06-20 01:08:57