在Python中逐字读取非常大的文件

Question

I have some pretty large text files (>2g) that I would like to process word by word. 我有一些非常大的文本文件（> 2g），我想逐字处理。 The files are space-delimited text files with no line breaks (all words are in a single line). 这些文件是用空格分隔的文本文件，没有换行符（所有单词都在一行中）。 I want to take each word, test if it is a dictionary word (using enchant), and if so, write it to a new file. 我想接受每个单词，测试它是否是字典单词（使用附魔），如果是，请将其写入新文件。

This is my code right now: 这是我现在的代码：

with open('big_file_of_words', 'r') as in_file:
        with open('output_file', 'w') as out_file:
            words = in_file.read().split(' ')
            for word in word:
                if d.check(word) == True:
                    out_file.write("%s " % word)

I looked at lazy method for reading big file in python , which suggests using yield to read in chunks, but I am concerned that using chunks of predetermined size will split words in the middle. 我研究了在python中读取大文件的惰性方法，该方法建议使用yield读取块，但我担心使用预定大小的块会在中间拆分单词。 Basically, I want chunks to be as close to the specified size while splitting only on spaces. 基本上，我希望块仅在空格上分割时尽可能接近指定的大小。 Any suggestions? 有什么建议么？

Answer 1

Combine the last word of one chunk with the first of the next: 将一个块的最后一个词与下一个的第一个结合：

def read_words(filename):
    last = ""
    with open(filename) as inp:
        while True:
            buf = inp.read(10240)
            if not buf:
                break
            words = (last+buf).split()
            last = words.pop()
            for word in words:
                yield word
        yield last

with open('output.txt') as output:
    for word in read_words('input.txt'):
        if check(word):
            output.write("%s " % word)

Answer 2

You might be able to get away with something similar to an answer on the question you've linked to, but combining re and mmap , eg: 您可能可以通过类似于与您所链接的问题的答案类似的方法，但是将re和mmap结合起来，例如：

import mmap
import re

with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
    mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
    for word in re.finditer('\w+', mf):
        # do something

Answer 3

fortunately Petr Viktorin has already written code for us. 幸运的是Petr Viktorin已经为我们编写了代码。 The following code reads a chunk from a file, then does a yield for each contained word. 以下代码从文件中读取一个块，然后对每个包含的单词yield一个yield 。 If a word spans chunks, that's handled also. 如果一个单词占了大块，那也可以处理。

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

https://stackoverflow.com/a/7745406/143880 https://stackoverflow.com/a/7745406/143880

在Python中逐字读取非常大的文件

问题描述

3 个解决方案

解决方案1
5 2014-08-18 21:36:23

解决方案2
1 2014-08-18 21:35:33

解决方案3
0 2014-08-18 21:35:51

在Python中逐字读取非常大的文件

问题描述

3 个解决方案

解决方案1 5 2014-08-18 21:36:23

解决方案2 1 2014-08-18 21:35:33

解决方案3 0 2014-08-18 21:35:51

解决方案1
5 2014-08-18 21:36:23

解决方案2
1 2014-08-18 21:35:33

解决方案3
0 2014-08-18 21:35:51