简体   繁体   English

如何保存一个大的 pandas dataframe 与 compex arrays 并再次加载?

[英]How to save a large pandas dataframe with compex arrays and load it up again?

I have a large pandas DataFrame with individual elements that are complex numpy arrays. Please see below a minimal code example to reproduce the scenario:我有一个大的 pandas DataFrame,其中包含复杂的各个元素 numpy arrays。请参阅下面的最小代码示例来重现该场景:


d = {f'x{i}': [] for i in range(4)}
df = pd.DataFrame(data=d).astype(object)

for K in range(4): 
    for i in range(4): 

        df.loc[f'{K}', f'x{i}'] = np.random.random(size=(2,2)) + np.random.random(size=(2,2)) * 1j

df

What is the best way to save these and load them up again for use later?保存这些并再次加载它们以备后用的最佳方法是什么?

The problem I am having is that when I increase the size of the matrices stored and the number of elements, I get an OverflowError when I try to save it as .h5 file as shown below:我遇到的问题是,当我增加存储的矩阵的大小时和元素的数量时,当我尝试将其另存为.h5文件时出现OverflowError ,如下所示:

import pandas as pd 

size = (300,300)
xs = 1500

d = {f'x{i}': [] for i in range(xs)}
df = pd.DataFrame(data=d).astype(object)

for K in range(10): 
    for i in range(xs): 

        df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j

df.to_hdf('test.h5', key="df", mode="w")

load_test = pd.read_hdf("test.h5", "df")
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-124-8cb8df1a0653> in <module>
     12         df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j
     13 
---> 14 df.to_hdf('test.h5', key="df", mode="w")
     15 
     16 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
   2447             data_columns=data_columns,
   2448             errors=errors,
-> 2449             encoding=encoding,
   2450         )
   2451 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
    268             path_or_buf, mode=mode, complevel=complevel, complib=complib
    269         ) as store:
--> 270             f(store)
    271     else:
    272         f(path_or_buf)

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in <lambda>(store)
    260             data_columns=data_columns,
    261             errors=errors,
--> 262             encoding=encoding,
    263         )
    264 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times)
   1127             encoding=encoding,
   1128             errors=errors,
-> 1129             track_times=track_times,
   1130         )
   1131 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
   1799             nan_rep=nan_rep,
   1800             data_columns=data_columns,
-> 1801             track_times=track_times,
   1802         )
   1803 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write(self, obj, **kwargs)
   3189             # I have no idea why, but writing values before items fixed #2299
   3190             blk_items = data.items.take(blk.mgr_locs)
-> 3191             self.write_array(f"block{i}_values", blk.values, items=blk_items)
   3192             self.write_index(f"block{i}_items", blk_items)
   3193 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write_array(self, key, value, items)
   3047 
   3048             vlarr = self._handle.create_vlarray(self.group, key, _tables().ObjectAtom())
-> 3049             vlarr.append(value)
   3050 
   3051         elif empty_array:

~/PQKs/pqks/lib/python3.6/site-packages/tables/vlarray.py in append(self, sequence)
    526             nparr = None
    527 
--> 528         self._append(nparr, nobjects)
    529         self.nrows += 1
    530 

~/PQKs/pqks/lib/python3.6/site-packages/tables/hdf5extension.pyx in tables.hdf5extension.VLArray._append()

OverflowError: value too large to convert to int

As noted in the similar issue https://stackoverflow.com/a/57133759/8896855 , hdf/h5 files have more overhead and are intended to optimize many dataframes saved into a single file system.如类似问题https://stackoverflow.com/a/57133759/8896855中所述,hdf/h5 文件具有更多开销,旨在优化保存到单个文件系统中的许多数据帧。 Feather and parquet objects will likely provide a better solution in terms of saving/loading a larger single dataframe as an in-memory object. In terms of the specific overflow error, this likely is the result of having larger mixed-type (as numpy array) columns stored in the "object" type in pandas. One (more complicated) option would be to split out the arrays in your dataframe into separate columns, but that's probably unnecessary. Feather 和 parquet 对象可能会在将更大的单个 dataframe 作为内存中的 object 保存/加载方面提供更好的解决方案。就特定的溢出错误而言,这可能是具有更大的混合类型(如 numpy 数组)的结果) 列存储在 pandas 中的“对象”类型中。一个(更复杂的)选项是将 dataframe 中的 arrays 拆分为单独的列,但这可能是不必要的。

A general quick fix would be to use df.to_pickle(r'path_to/filename.pkl') , but to_feather or to_parquet likely present more optimized solutions.一般的快速解决方法是使用df.to_pickle(r'path_to/filename.pkl') ,但 to_feather 或 to_parquet 可能会提供更优化的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM