在熊猫hdf5中保存数据时的宽格式与长格式

Question

pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format. 大熊猫数据框通常以长（很多行）或宽（很多列）格式表示。

I'm wondering which format is faster to read and occupies less memory when saved as hdf file ( df.to_hdf ). 我想知道哪种格式保存为hdf文件（ df.to_hdf ）时读取速度更快，占用的内存更少。

Is there a general rule or some cases where one of the format should be preferred? 是否有一般规则或某些情况下应首选一种格式？

Answer 1

IMO long format is much more preferable as you will have much less metadata overhead (information about column names, dtypes, etc.). 最好使用IMO长格式，因为元数据开销（有关列名，dtype等的信息）要少得多。

In term of memory usage they are going to be more or less the same: 就内存使用而言，它们将大致相同：

In [22]: long = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 4)))

In [23]: wide = pd.DataFrame(np.random.randint(0, 10**6, (4, 10**4)))

In [24]: long.shape
Out[24]: (10000, 4)

In [25]: wide.shape
Out[25]: (4, 10000)

In [26]: sys.getsizeof(long)
Out[26]: 160104

In [27]: sys.getsizeof(wide)
Out[27]: 160104

In [28]: wide.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 156.3 KB

In [29]: long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
0    10000 non-null int32
1    10000 non-null int32
2    10000 non-null int32
3    10000 non-null int32
dtypes: int32(4)
memory usage: 156.3 KB

在熊猫hdf5中保存数据时的宽格式与长格式

问题描述

1 个解决方案

解决方案1
0 2016-11-11 10:36:10

在熊猫hdf5中保存数据时的宽格式与长格式

问题描述

1 个解决方案

解决方案1 0 2016-11-11 10:36:10

解决方案1
0 2016-11-11 10:36:10