使用readline进行python采样会导致内存错误

Question

I tried to sample a data file with over 260 million lines, creating an evenly distributed sample with a fixed size of 1000 samples. 我尝试对一个包含2.6亿行的数据文件进行采样，以1000个样本的固定大小创建了一个均匀分布的样本。

What I did was the following: 我所做的如下：

import random

file = "input.txt"
output = open("output.txt", "w+", encoding = "utf-8")

samples = random.sample(range(1, 264000000), 1000)
samples.sort(reverse=False)

with open(file, encoding = "utf-8") as fp:
    line = fp.readline()
    count = 0
    while line:
        if count in samples:
            output.write(line)
            samples.remove(count)
        count += 1
        line = fp.readline()

This code resulted in a memory error, with no further description. 此代码导致内存错误，没有进一步描述。 How come this code can give a memory error? 这段代码怎么会导致内存错误？

As far as I know it should read my file line by line. 据我所知，它应该逐行读取我的文件。 The file is 28.4GB, so it can't be read as a whole, which is why I resorted to the readline() approach. 该文件为28.4GB，因此无法整体读取，这就是为什么我诉诸readline（）方法的原因。 How could I fix this, so that the whole file can be processed, regardless of its size?\\ 我如何解决此问题，以便无论文件大小如何都可以处理整个文件？\\

EDIT: The latest attempts throw this error, which is practically identical to each of the prior error messages I have gotten so far 编辑：最近的尝试将引发此错误，这实际上与我到目前为止获得的每个先前的错误消息相同

MemoryError                               Traceback (most recent call last)
<ipython-input-1-a772dad1ea5a> in <module>()
     12 with open(file, encoding = "utf-8") as fp:
     13     count = 0
---> 14     for line in fp:
     15         if count in samples:
     16             output.write(line)

~\Anaconda3\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

MemoryError:

Answer 1

So it looks like this line causes a huge memory spike: 因此，看起来这行会导致巨大的内存高峰：

samples = random.sample(range(1, 264000000), 1000)

My guess is that this call forces python to create all 264M ints in that range before it can do the sampling. 我的猜测是，此调用会强制python在进行采样之前创建该范围内的所有264M int。 Try this code instead for sampling in the same range without replacement: 请尝试使用以下代码在相同范围内进行采样而不进行替换：

from random import randint

file = "input.txt"
output = open("output.txt", "w+", encoding = "utf-8")

samples = set()
while len(samples) < 1000:
    random_num = randint(0, 264000000)
    if random_num not in samples:
        samples.add(random_num)

with open(file, encoding = "utf-8") as fp:
    count = 0
    for line in fp:
        if count in samples:
            output.write(line)
            samples.remove(count)
        count += 1

        if not samples: break

Answer 2

SOLVED 解决了

I finally solved the problem: all code here works properly, the range issue is indeed only present in versions prior to 3.0, where it should be xrange(1, 264000000). 我终于解决了这个问题：这里的所有代码都能正常工作，范围问题确实仅在3.0之前的版本中存在，它应该是xrange（1，264000000）。

The input file was constructed in a different code file, where it was written as follows: 输入文件是用另一个代码文件构造的，其编写方式如下：

with open(file, encoding = "utf-8", errors = 'ignore') as fp:  
line = fp.readline()
    while line:
        input_line = line.split(sep="\t")
        output.write(input_line[1] + "," + input_line[2])
        line = fp.readline()

The problem here is that this code does not construct a file with lines, but just adds information to the first line. 这里的问题是该代码不会用行构造文件，而只是将信息添加到第一行。 Therefore, the whole file was read as one big line, instead of as a file with a lot of lines to iterate over. 因此，整个文件被读为一行大行，而不是一个有很多行要遍历的文件。

Thanks a lot for your help and my sincere apologies for the fact that the problem was situated elsewhere in my project. 非常感谢您的帮助，对于问题存在于我项目的其他位置，我表示由衷的歉意。

使用readline进行python采样会导致内存错误

问题描述

2 个解决方案

解决方案1
0 2018-10-31 23:37:58

解决方案2
0 已采纳 2018-11-05 17:06:17

使用readline进行python采样会导致内存错误

问题描述

2 个解决方案

解决方案1 0 2018-10-31 23:37:58

解决方案2 0 已采纳 2018-11-05 17:06:17

解决方案1
0 2018-10-31 23:37:58

解决方案2
0 已采纳 2018-11-05 17:06:17