简体   繁体   English

解析文本文件中的唯一单词

[英]Parsing unique words from a text file

I'm working on a project to parse out unique words from a large number of text files. 我正在开发一个项目来解析大量文本文件中的独特单词。 I've got the file handling down, but I'm trying to refine the parsing procedure. 我有文件处理,但我正在尝试改进解析过程。 Each file has a specific text segment that ends with certain phrases that I'm catching with a regex on my live system. 每个文件都有一个特定的文本段,以我在实时系统上使用正则表达式捕获的某些短语结尾。

The parser should walk through each line, and check each word against 3 criteria: 解析器应遍历每一行,并根据3个条件检查每个单词:

  1. Longer than two characters 超过两个字符
  2. Not in a predefined dictionary set dict_file 不在预定义的字典集dict_file
  3. Not already in the word list 尚未出现在单词列表中

The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed. 结果应该是2D数组,每行包含每个文件的唯一字列表,在处理.writerow(foo)每个文件后使用.writerow(foo)方法将其写入CSV文件。

My working code's below, but it's slow and kludgy, what am I missing? 我的工作代码在下面,但它很慢而且很笨拙,我错过了什么?

My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can't be upgraded to 2.7+. 我的生产系统只使用默认模块运行2.5.1(因此NLTK是禁止的),无法升级到2.7+。

def process(line):
    line_strip = line.strip()
    return line_strip.translate(punct, string.punctuation)

# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
    for line in report:
        # Strip out the CR/LF and punctuation from the input line
        line_check = process(line)
        if line_check == "FOOTNOTES":
            break
        for word in line_check.split():
            word_check = word.lower()
            if ((word_check not in report_set) and (word_check not in dict_file) 
                 and (len(word) > 2)):
                report_set.append(word_check)
report_list = list(report_set)

Edit: Updated my code based on steveha's recommendations. 编辑:根据steveha的建议更新了我的代码。

One problem is that an in test for a list is slow. 一个问题是,一个in测试的list是缓慢的。 You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast. 您应该保留一set以跟踪您所看到的单词,因为对于setin测试非常快。

Example: 例:

report_set = set()
for line in report:
    for word in line.split():
        if we_want_to_keep_word(word):
            report_set.add(word)

Then when you are done: report_list = list(report_set) 完成后:report_list = list(report_set)

Anytime you need to force a set into a list , you can. 任何时候你需要强制一个setlist ,你可以。 But if you just need to loop over it or do in tests, you can leave it as a set ; 但是如果你只是需要循环它或者in测试中做,你可以把它作为一set ; it's legal to do for x in report_set: for x in report_set:

Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. 另一个可能或可能不重要的问题是,您使用.readlines()方法一次性地从文件中.readlines()所有行。 For really large files it is better to just use the open file-handle object as an iterator, like so: 对于非常大的文件,最好只使用open file-handle对象作为迭代器,如下所示:

with open("filename", "r") as f:
    for line in f:
        ... # process each line here

A big problem is that I don't even see how this code can work: 一个大问题是我甚至没有看到这段代码如何工作:

while 1:
    lines = report.readlines()
    if not lines:
        break

This will loop forever. 这将永远循环。 The first statement slurps all input lines with .readlines() , then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. 第一个语句使用.readlines()覆盖所有输入行,然后我们再次循环,然后下一次调用.readlines() report已经用尽,因此对.readlines()的调用返回一个空列表,该列表突破了无限循环。 But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. 但是现在这已经丢失了我们刚读过的所有行,其余的代码必须使用空lines变量。 How does this even work? 这怎么工作?

So, get rid of that whole while 1 loop, and change the next loop to for line in report: . 因此, while 1循环中摆脱整个,并for line in report:中将下一循环更改为for line in report:

Also, you don't really need to keep a count variable. 此外,您实际上不需要保留count变量。 You can use len(report_set) at any time to find out how many words are in the set . 您可以随时使用len(report_set)来查找set有多少单词。

Also, with a set you don't actually need to check whether a word is in the set; 另外,使用一个set您实际上不需要检查单词是否in集合中; you can just always call report_set.add(word) and if it's already in the set it won't be added again! 您可以随时调用report_set.add(word) ,如果它已经在set ,则不会再次添加!

Also, you don't have to do it my way, but I like to make a generator that does all the processing. 此外,您不必我的方式去做,但我喜欢做一个发电机,做所有的处理。 Strip the line, translate the line, split on whitespace, and yield up words ready to use. 剥离线条,平移线条,拆分空白区域,并准备好使用的单词。 I would also force the words to lower-case except I don't know whether it's important that FOOTNOTES be detected only in upper-case. 我也会强制说小写,但我不知道只有大写才能检测到FOOTNOTES是否重要。

So, put all the above together and you get: 所以,把以上所有内容放在一起,你得到:

def words(file_object):
    for line in file_object:
        line = line.strip().translate(None, string.punctuation)
        for word in line.split():
            yield word

report_set = set()
with open(fullpath, 'r') as report:
    for word in words(report):
        if word == "FOOTNOTES":
            break
        word = word.lower()
        if len(word) > 2 and word not in dict_file:
            report_set.add(word)

print("Words in report_set: %d" % len(report_set))

Try replacing report_list with a dictionary or set. 尝试使用字典或集替换report_list。 word_check not in report_list works slow if report_list is a list 如果report_list是列表,则word_check不在report_list中工作很慢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM