用 Python 读取大文本文件

Question

I want to get each line from a text file in Python ( around 1 billion lines) and from each line I am taking some words and inserting in another file I have used我想从 Python 中的文本文件中获取每一行（大约 10 亿行），并且从每一行中获取一些单词并插入我使用过的另一个文件中

with open('') as f:
   for line in f:
       process_line(line)

This process is taking a lot of time, How can I process this to read all the contents in about 2 hours ?这个过程需要很多时间，我如何处理这个以在大约 2 小时内阅读所有内容？

Answer 1

The bottleneck of the performance of your script likely comes from the fact that it is writing to 3 files at the same time, causing massive fragmentation between the files and hence lots of overhead.脚本性能的瓶颈可能来自它同时写入 3 个文件的事实，导致文件之间出现大量碎片，从而产生大量开销。

So instead of writing to 3 files at the same time as you read through the lines, you can buffer up a million lines (which should take less than 1GB of memory), before you write the 3 million words to the output files one file at a time, so that it will produce much less file fragmentation:因此，在您将 300 万字写入输出文件之前，您可以缓冲一百万行（这应该占用少于 1GB 的内存），而不是在读取行的同时写入 3 个文件一段时间，这样它会产生更少的文件碎片：

def write_words(words, *files):
    for i, file in enumerate(files):
        for word in words:
            file.write(word[i] + '\n')

words = []
with open('input.txt', 'r') as f, open('words1.txt', 'w') as out1, open('words2.txt', 'w') as out2, open('words3.txt', 'w') as out3:
    for count, line in enumerate(f, 1):
        words.append(line.rstrip().split(','))
        if count % 1000000 == 0:
            write_words(words, out1, out2, out3)
            words = []
    write_words(words, out1, out2, out3)

Answer 2

read about generators in Python.阅读 Python 中的生成器。 Yours code should look like this:您的代码应如下所示：

def read_file(yours_file):
    while True:
        data = yours_file.readline()
        if not data:
            break
        yield data

用 Python 读取大文本文件

问题描述

2 个解决方案

解决方案1
3 2018-10-31 17:14:06

解决方案2
1 2018-10-31 19:31:31

用 Python 读取大文本文件

问题描述

2 个解决方案

解决方案1 3 2018-10-31 17:14:06

解决方案2 1 2018-10-31 19:31:31

解决方案1
3 2018-10-31 17:14:06

解决方案2
1 2018-10-31 19:31:31