Python 对象序列化：pickle 与 hickle 有问题

Question

For couple of days now, I am stuck on my machine learning project.几天来，我被困在我的机器学习项目上。 I have a python script that should transform the data for model training by a second script.我有一个 python 脚本，它应该通过第二个脚本转换模型训练的数据。 In the first script is a list of arrays that I would like to dump to the disk, the second unpickle it.在第一个脚本中是我想转储到磁盘的数组列表，第二个是解压它。

I tried using pickle several times, but every time the script attempts pickling, I get memory error:我尝试多次使用pickle ，但每次脚本尝试pickle时，都会出现内存错误：

Traceback (most recent call last):
  File "Prepare_Input.py", line 354, in <module>
    pickle.dump(Total_Velocity_Change, file)
MemoryError

And sometime, this script is forced to stop running with a Killed message.有时，这个脚本被迫停止运行并显示Killed消息。

I also tried using hickle however, the script keeps running for long time with hickle dumping huge file of nearly 10GB ( du -sh myfile.hkl ) when left overnight.我也尝试使用hickle但是，脚本长时间运行， hickle转储近 10GB 的大文件（ du -sh myfile.hkl ）过夜。 I am certain there no way the array size can exceeds 1.5GB at most.我确信数组大小最多不会超过 1.5GB。 I can also dump the array to the console ( print ).我还可以将数组转储到控制台 ( print )。 Using hickle , I had to killed the process to stop the script running.使用hickle ，我不得不终止进程以停止脚本运行。

I also tried all the answers here , unfortunately, none worked for me.我也在这里尝试了所有的答案，不幸的是，没有一个对我有用。

Does anyone have an idea how I can safely dump my file to disk for later loading?有没有人知道如何安全地将文件转储到磁盘以供以后加载？

Using dill I get the following errors:使用 dill 我收到以下错误：

Traceback (most recent call last):
  File "Prepare_Input.py", line 356, in <module>
    dill.dump(Total_Velocity_Change, fp)
  File "/home/akil/Desktop/tmd/venv/lib/python3.7/site-packages/dill/_dill.py", line 259, in dump
    Pickler(file, protocol, **_kwds).dump(obj)
  File "/home/akil/Desktop/tmd/venv/lib/python3.7/site-packages/dill/_dill.py", line 445, in dump
    StockPickler.dump(self, obj)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 735, in save_bytes
    self.memoize(obj)
  File "/home/akil/anaconda3/lib/python3.7/pickle.py", line 461, in memoize
    self.memo[id(obj)] = idx, obj
MemoryError

Answer 1

If you want to dump a huge list of arrays, you might want to look at dask or klepto .如果您想转储大量数组，您可能需要查看dask或klepto 。 dask could break up the list into lists of sub-arrays, while klepto could break up the list into a dict of sub-arrays (with the key indicating the ordering of the sub-arrays). dask可以将列表分解为子数组的列表，而klepto可以将列表分解为子数组的字典（键表示子数组的顺序）。

>>> import klepto as kl
>>> import numpy as np
>>> big = np.random.randn(10,100)  # could be a huge array
>>> ar = kl.archives.dir_archive('foo', dict(enumerate(big)), cached=False)
>>> list(ar.keys())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>>

Then one entry per file is serialized to disk (in output.pkl)然后将每个文件的一个条目序列化到磁盘（在 output.pkl 中）

$ ls foo/K_0/
input.pkl   output.pkl

Python 对象序列化：pickle 与 hickle 有问题

问题描述

1 个解决方案

解决方案1
3 2020-03-08 11:38:44

Python 对象序列化：pickle 与 hickle 有问题

问题描述

1 个解决方案

解决方案1 3 2020-03-08 11:38:44

解决方案1
3 2020-03-08 11:38:44