python sampling using readline gives memory error

Question

I tried to sample a data file with over 260 million lines, creating an evenly distributed sample with a fixed size of 1000 samples.

What I did was the following:

import random

file = "input.txt"
output = open("output.txt", "w+", encoding = "utf-8")

samples = random.sample(range(1, 264000000), 1000)
samples.sort(reverse=False)

with open(file, encoding = "utf-8") as fp:
    line = fp.readline()
    count = 0
    while line:
        if count in samples:
            output.write(line)
            samples.remove(count)
        count += 1
        line = fp.readline()

This code resulted in a memory error, with no further description. How come this code can give a memory error?

As far as I know it should read my file line by line. The file is 28.4GB, so it can't be read as a whole, which is why I resorted to the readline() approach. How could I fix this, so that the whole file can be processed, regardless of its size?\\

EDIT: The latest attempts throw this error, which is practically identical to each of the prior error messages I have gotten so far

MemoryError                               Traceback (most recent call last)
<ipython-input-1-a772dad1ea5a> in <module>()
     12 with open(file, encoding = "utf-8") as fp:
     13     count = 0
---> 14     for line in fp:
     15         if count in samples:
     16             output.write(line)

~\Anaconda3\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

MemoryError:

Answer 1

So it looks like this line causes a huge memory spike:

samples = random.sample(range(1, 264000000), 1000)

My guess is that this call forces python to create all 264M ints in that range before it can do the sampling. Try this code instead for sampling in the same range without replacement:

from random import randint

file = "input.txt"
output = open("output.txt", "w+", encoding = "utf-8")

samples = set()
while len(samples) < 1000:
    random_num = randint(0, 264000000)
    if random_num not in samples:
        samples.add(random_num)

with open(file, encoding = "utf-8") as fp:
    count = 0
    for line in fp:
        if count in samples:
            output.write(line)
            samples.remove(count)
        count += 1

        if not samples: break

Answer 2

SOLVED

I finally solved the problem: all code here works properly, the range issue is indeed only present in versions prior to 3.0, where it should be xrange(1, 264000000).

The input file was constructed in a different code file, where it was written as follows:

with open(file, encoding = "utf-8", errors = 'ignore') as fp:  
line = fp.readline()
    while line:
        input_line = line.split(sep="\t")
        output.write(input_line[1] + "," + input_line[2])
        line = fp.readline()

The problem here is that this code does not construct a file with lines, but just adds information to the first line. Therefore, the whole file was read as one big line, instead of as a file with a lot of lines to iterate over.

Thanks a lot for your help and my sincere apologies for the fact that the problem was situated elsewhere in my project.

python sampling using readline gives memory error

Question

2 answers

solution1
0 2018-10-31 23:37:58

solution2
0 ACCPTED 2018-11-05 17:06:17

python sampling using readline gives memory error

Question

2 answers

solution1 0 2018-10-31 23:37:58

solution2 0 ACCPTED 2018-11-05 17:06:17

solution1
0 2018-10-31 23:37:58

solution2
0 ACCPTED 2018-11-05 17:06:17