在Python中逐字讀取非常大的文件

Question

我有一些非常大的文本文件（> 2g），我想逐字處理。 這些文件是用空格分隔的文本文件，沒有換行符（所有單詞都在一行中）。 我想接受每個單詞，測試它是否是字典單詞（使用附魔），如果是，請將其寫入新文件。

這是我現在的代碼：

with open('big_file_of_words', 'r') as in_file:
        with open('output_file', 'w') as out_file:
            words = in_file.read().split(' ')
            for word in word:
                if d.check(word) == True:
                    out_file.write("%s " % word)

我研究了在python中讀取大文件的惰性方法，該方法建議使用yield讀取塊，但我擔心使用預定大小的塊會在中間拆分單詞。 基本上，我希望塊僅在空格上分割時盡可能接近指定的大小。 有什么建議么？

Answer 1

將一個塊的最后一個詞與下一個的第一個結合：

def read_words(filename):
    last = ""
    with open(filename) as inp:
        while True:
            buf = inp.read(10240)
            if not buf:
                break
            words = (last+buf).split()
            last = words.pop()
            for word in words:
                yield word
        yield last

with open('output.txt') as output:
    for word in read_words('input.txt'):
        if check(word):
            output.write("%s " % word)

Answer 2

您可能可以通過類似於與您所鏈接的問題的答案類似的方法，但是將re和mmap結合起來，例如：

import mmap
import re

with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
    mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
    for word in re.finditer('\w+', mf):
        # do something

Answer 3

幸運的是Petr Viktorin已經為我們編寫了代碼。 以下代碼從文件中讀取一個塊，然后對每個包含的單詞yield一個yield 。 如果一個單詞占了大塊，那也可以處理。

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

https://stackoverflow.com/a/7745406/143880

在Python中逐字讀取非常大的文件

問題描述

3 個解決方案

解決方案1
5 2014-08-18 21:36:23

解決方案2
1 2014-08-18 21:35:33

解決方案3
0 2014-08-18 21:35:51

在Python中逐字讀取非常大的文件

問題描述

3 個解決方案

解決方案1 5 2014-08-18 21:36:23

解決方案2 1 2014-08-18 21:35:33

解決方案3 0 2014-08-18 21:35:51

解決方案1
5 2014-08-18 21:36:23

解決方案2
1 2014-08-18 21:35:33

解決方案3
0 2014-08-18 21:35:51