[英]How to save a large pandas dataframe with compex arrays and load it up again?
I have a large pandas DataFrame with individual elements that are complex numpy arrays. Please see below a minimal code example to reproduce the scenario:我有一个大的 pandas DataFrame,其中包含复杂的各个元素 numpy arrays。请参阅下面的最小代码示例来重现该场景:
d = {f'x{i}': [] for i in range(4)}
df = pd.DataFrame(data=d).astype(object)
for K in range(4):
for i in range(4):
df.loc[f'{K}', f'x{i}'] = np.random.random(size=(2,2)) + np.random.random(size=(2,2)) * 1j
df
What is the best way to save these and load them up again for use later?保存这些并再次加载它们以备后用的最佳方法是什么?
The problem I am having is that when I increase the size of the matrices stored and the number of elements, I get an OverflowError
when I try to save it as .h5
file as shown below:我遇到的问题是,当我增加存储的矩阵的大小时和元素的数量时,当我尝试将其另存为.h5
文件时出现OverflowError
,如下所示:
import pandas as pd
size = (300,300)
xs = 1500
d = {f'x{i}': [] for i in range(xs)}
df = pd.DataFrame(data=d).astype(object)
for K in range(10):
for i in range(xs):
df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j
df.to_hdf('test.h5', key="df", mode="w")
load_test = pd.read_hdf("test.h5", "df")
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-124-8cb8df1a0653> in <module>
12 df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j
13
---> 14 df.to_hdf('test.h5', key="df", mode="w")
15
16
~/PQKs/pqks/lib/python3.6/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
2447 data_columns=data_columns,
2448 errors=errors,
-> 2449 encoding=encoding,
2450 )
2451
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
268 path_or_buf, mode=mode, complevel=complevel, complib=complib
269 ) as store:
--> 270 f(store)
271 else:
272 f(path_or_buf)
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in <lambda>(store)
260 data_columns=data_columns,
261 errors=errors,
--> 262 encoding=encoding,
263 )
264
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times)
1127 encoding=encoding,
1128 errors=errors,
-> 1129 track_times=track_times,
1130 )
1131
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
1799 nan_rep=nan_rep,
1800 data_columns=data_columns,
-> 1801 track_times=track_times,
1802 )
1803
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write(self, obj, **kwargs)
3189 # I have no idea why, but writing values before items fixed #2299
3190 blk_items = data.items.take(blk.mgr_locs)
-> 3191 self.write_array(f"block{i}_values", blk.values, items=blk_items)
3192 self.write_index(f"block{i}_items", blk_items)
3193
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write_array(self, key, value, items)
3047
3048 vlarr = self._handle.create_vlarray(self.group, key, _tables().ObjectAtom())
-> 3049 vlarr.append(value)
3050
3051 elif empty_array:
~/PQKs/pqks/lib/python3.6/site-packages/tables/vlarray.py in append(self, sequence)
526 nparr = None
527
--> 528 self._append(nparr, nobjects)
529 self.nrows += 1
530
~/PQKs/pqks/lib/python3.6/site-packages/tables/hdf5extension.pyx in tables.hdf5extension.VLArray._append()
OverflowError: value too large to convert to int
As noted in the similar issue https://stackoverflow.com/a/57133759/8896855 , hdf/h5 files have more overhead and are intended to optimize many dataframes saved into a single file system.如类似问题https://stackoverflow.com/a/57133759/8896855中所述,hdf/h5 文件具有更多开销,旨在优化保存到单个文件系统中的许多数据帧。 Feather and parquet objects will likely provide a better solution in terms of saving/loading a larger single dataframe as an in-memory object. In terms of the specific overflow error, this likely is the result of having larger mixed-type (as numpy array) columns stored in the "object" type in pandas. One (more complicated) option would be to split out the arrays in your dataframe into separate columns, but that's probably unnecessary. Feather 和 parquet 对象可能会在将更大的单个 dataframe 作为内存中的 object 保存/加载方面提供更好的解决方案。就特定的溢出错误而言,这可能是具有更大的混合类型(如 numpy 数组)的结果) 列存储在 pandas 中的“对象”类型中。一个(更复杂的)选项是将 dataframe 中的 arrays 拆分为单独的列,但这可能是不必要的。
A general quick fix would be to use df.to_pickle(r'path_to/filename.pkl')
, but to_feather or to_parquet likely present more optimized solutions.一般的快速解决方法是使用df.to_pickle(r'path_to/filename.pkl')
,但 to_feather 或 to_parquet 可能会提供更优化的解决方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.