Python泡菜文件异常大

Question

I made a pickle file, storing a grayscale value of each pixel in 100,000 80x80 sized images. 我制作了一个pickle文件，在100,000个80x80尺寸的图像中存储每个像素的灰度值。

(Plus an array of 100,000 integers whose values are one-digit). （加上由100,000个整数组成的数组，其值是一位数字）。

My approximation for the total size of the pickle is, 我估计泡菜的总大小是

4 byte x 80 x 80 x 100000 = 2.88 GB

plus the array of integers, which shouldn't be that large. 加上整数数组，该数组不应那么大。

The generated pickle file however is over 16GB, so it's taking hours just to unpickle it and load it, and it eventually freezes, after it takes full memory resources. 但是，生成的pickle文件超过16GB，因此要释放它并加载它需要花费几个小时，而在占用全部内存资源之后，它最终会冻结。

Is there something wrong with my calculation or is it the way I pickled it? 我的计算有问题吗？还是我腌制它的方式？

I pickled the file in the following way. 我以以下方式腌制了文件。

from PIL import Image
import pickle
import os
import numpy
import time

trainpixels = numpy.empty([80000,6400])
trainlabels = numpy.empty(80000)
validpixels = numpy.empty([10000,6400])
validlabels = numpy.empty(10000)
testpixels = numpy.empty([10408,6400])
testlabels = numpy.empty(10408)

i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
    print 'hello'
    for f in filenames:
        try:
                im = Image.open(os.path.join(root,f))
                Imv=im.load()
                x,y=im.size
                pixelv = numpy.empty(6400)
                ind=0
                for ii in range(x):
                        for j in range(y):
                                temp=float(Imv[j,ii])
                                temp=float(temp/255.0)
                                pixelv[ind]=temp
                                ind+=1
                if i<40000:
                        trainpixels[tr]=pixelv
                        tr+=1
                elif i<45000:
                        validpixels[va]=pixelv
                        va+=1
                else:
                        testpixels[te]=pixelv
                        te+=1
                print str(i)+'\t'+str(f)
                i+=1
        except IOError:
                continue
trainimage=(trainpixels,trainlabels)
validimage=(validpixels,validlabels)
testimage=(testpixels,testlabels)

output=open('data.pkl','wb')

pickle.dump(trainimage,output)
pickle.dump(validimage,output)
pickle.dump(testimage,output)

Please let me know if you see something wrong with either my calculation or my code! 如果您发现我的计算或代码有问题，请告诉我！

Answer 1

Python Pickles are not a thrifty mechanism for storing data as you're storing objects instead of "just the data." Python泡菜并不是在存储对象时节省数据的节俭机制，而不是“仅是数据”。

The following test case takes 24kb on my system and this is for a small, sparsely populated numpy array stored in a pickle : 以下测试用例在我的系统上占用24kb的空间，用于存储在pickle的小型稀疏numpy数组：

import os
import sys
import numpy
import pickle

testlabels = numpy.empty(1000)
testlabels[0] = 1
testlabels[99] = 0

test_labels_size = sys.getsizeof(testlabels) #80

output = open('/tmp/pickle', 'wb')
test_labels_pickle = pickle.dump(testlabels, output)

print os.path.getsize('/tmp/pickle')

Further, I'm not sure why you believe 4kb to be the size of a number in Python -- non-numpy ints are 24 bytes ( sys.getsizeof(1) ) and numpy arrays are a minimum of 80 bytes ( sys.getsizeof(numpy.array([0], float)) ). 此外，我不确定为什么您相信4kb是Python中数字的大小-非numpy整数为24字节（ sys.getsizeof(1) ），而numpy数组至少为80字节（ sys.getsizeof(numpy.array([0], float)) ）。

As you stated as a response to my comment, you have reasons for staying with Pickle, so I won't try to convince you further to not store objects, but be aware of the overhead of storing objects. 正如您对我的评论的回答所述，您有理由留在Pickle，所以我不会试图说服您不要存储对象，但是要知道存储对象的开销。

As an option: reduce the size of your training data/Pickle fewer objects. 作为一种选择：减小训练数据的大小/为对象添加更少的位置。

Python泡菜文件异常大

问题描述

1 个解决方案

解决方案1
2 已采纳

Python泡菜文件异常大

问题描述

1 个解决方案

解决方案1 2 已采纳

解决方案1
2 已采纳