简体   繁体   English

在Python中逐字读取非常大的文件

[英]Reading a very large file word by word in Python

I have some pretty large text files (>2g) that I would like to process word by word. 我有一些非常大的文本文件(> 2g),我想逐字处理。 The files are space-delimited text files with no line breaks (all words are in a single line). 这些文件是用空格分隔的文本文件,没有换行符(所有单词都在一行中)。 I want to take each word, test if it is a dictionary word (using enchant), and if so, write it to a new file. 我想接受每个单词,测试它是否是字典单词(使用附魔),如果是,请将其写入新文件。

This is my code right now: 这是我现在的代码:

with open('big_file_of_words', 'r') as in_file:
        with open('output_file', 'w') as out_file:
            words = in_file.read().split(' ')
            for word in word:
                if d.check(word) == True:
                    out_file.write("%s " % word)

I looked at lazy method for reading big file in python , which suggests using yield to read in chunks, but I am concerned that using chunks of predetermined size will split words in the middle. 我研究了在python中读取大文件的惰性方法 ,该方法建议使用yield读取块,但我担心使用预定大小的块会在中间拆分单词。 Basically, I want chunks to be as close to the specified size while splitting only on spaces. 基本上,我希望块仅在空格上分割时尽可能接近指定的大小。 Any suggestions? 有什么建议么?

Combine the last word of one chunk with the first of the next: 将一个块的最后一个词与下一个的第一个结合:

def read_words(filename):
    last = ""
    with open(filename) as inp:
        while True:
            buf = inp.read(10240)
            if not buf:
                break
            words = (last+buf).split()
            last = words.pop()
            for word in words:
                yield word
        yield last

with open('output.txt') as output:
    for word in read_words('input.txt'):
        if check(word):
            output.write("%s " % word)

You might be able to get away with something similar to an answer on the question you've linked to, but combining re and mmap , eg: 您可能可以通过类似于与您所链接的问题的答案类似的方法,但是将remmap结合起来,例如:

import mmap
import re

with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
    mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
    for word in re.finditer('\w+', mf):
        # do something

fortunately Petr Viktorin has already written code for us. 幸运的是Petr Viktorin已经为我们编写了代码。 The following code reads a chunk from a file, then does a yield for each contained word. 以下代码从文件中读取一个块,然后对每个包含的单词yield一个yield If a word spans chunks, that's handled also. 如果一个单词占了大块,那也可以处理。

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

https://stackoverflow.com/a/7745406/143880 https://stackoverflow.com/a/7745406/143880

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM