使用readline進行python采樣會導致內存錯誤

Question

我嘗試對一個包含2.6億行的數據文件進行采樣，以1000個樣本的固定大小創建了一個均勻分布的樣本。

我所做的如下：

import random

file = "input.txt"
output = open("output.txt", "w+", encoding = "utf-8")

samples = random.sample(range(1, 264000000), 1000)
samples.sort(reverse=False)

with open(file, encoding = "utf-8") as fp:
    line = fp.readline()
    count = 0
    while line:
        if count in samples:
            output.write(line)
            samples.remove(count)
        count += 1
        line = fp.readline()

此代碼導致內存錯誤，沒有進一步描述。 這段代碼怎么會導致內存錯誤？

據我所知，它應該逐行讀取我的文件。 該文件為28.4GB，因此無法整體讀取，這就是為什么我訴諸readline（）方法的原因。 我如何解決此問題，以便無論文件大小如何都可以處理整個文件？\\

編輯：最近的嘗試將引發此錯誤，這實際上與我到目前為止獲得的每個先前的錯誤消息相同

MemoryError                               Traceback (most recent call last)
<ipython-input-1-a772dad1ea5a> in <module>()
     12 with open(file, encoding = "utf-8") as fp:
     13     count = 0
---> 14     for line in fp:
     15         if count in samples:
     16             output.write(line)

~\Anaconda3\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

MemoryError:

Answer 1

因此，看起來這行會導致巨大的內存高峰：

samples = random.sample(range(1, 264000000), 1000)

我的猜測是，此調用會強制python在進行采樣之前創建該范圍內的所有264M int。 請嘗試使用以下代碼在相同范圍內進行采樣而不進行替換：

from random import randint

file = "input.txt"
output = open("output.txt", "w+", encoding = "utf-8")

samples = set()
while len(samples) < 1000:
    random_num = randint(0, 264000000)
    if random_num not in samples:
        samples.add(random_num)

with open(file, encoding = "utf-8") as fp:
    count = 0
    for line in fp:
        if count in samples:
            output.write(line)
            samples.remove(count)
        count += 1

        if not samples: break

Answer 2

解決了

我終於解決了這個問題：這里的所有代碼都能正常工作，范圍問題確實僅在3.0之前的版本中存在，它應該是xrange（1，264000000）。

輸入文件是用另一個代碼文件構造的，其編寫方式如下：

with open(file, encoding = "utf-8", errors = 'ignore') as fp:  
line = fp.readline()
    while line:
        input_line = line.split(sep="\t")
        output.write(input_line[1] + "," + input_line[2])
        line = fp.readline()

這里的問題是該代碼不會用行構造文件，而只是將信息添加到第一行。 因此，整個文件被讀為一行大行，而不是一個有很多行要遍歷的文件。

非常感謝您的幫助，對於問題存在於我項目的其他位置，我表示由衷的歉意。

使用readline進行python采樣會導致內存錯誤

問題描述

2 個解決方案

解決方案1
0 2018-10-31 23:37:58

解決方案2
0 已采納 2018-11-05 17:06:17

使用readline進行python采樣會導致內存錯誤

問題描述

2 個解決方案

解決方案1 0 2018-10-31 23:37:58

解決方案2 0 已采納 2018-11-05 17:06:17

解決方案1
0 2018-10-31 23:37:58

解決方案2
0 已采納 2018-11-05 17:06:17