简体   繁体   English

在熊猫hdf5中保存数据时的宽格式与长格式

[英]wide vs long format when saving data in pandas hdf5

pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format. 大熊猫数据框通常以长(很多行)或宽(很多列)格式表示。

I'm wondering which format is faster to read and occupies less memory when saved as hdf file ( df.to_hdf ). 我想知道哪种格式保存为hdf文件( df.to_hdf )时读取速度更快,占用的内存更少。

Is there a general rule or some cases where one of the format should be preferred? 是否有一般规则或某些情况下应首选一种格式?

IMO long format is much more preferable as you will have much less metadata overhead (information about column names, dtypes, etc.). 最好使用IMO长格式,因为元数据开销(有关列名,dtype等的信息)要少得多。

In term of memory usage they are going to be more or less the same: 就内存使用而言,它们将大致相同:

In [22]: long = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 4)))

In [23]: wide = pd.DataFrame(np.random.randint(0, 10**6, (4, 10**4)))

In [24]: long.shape
Out[24]: (10000, 4)

In [25]: wide.shape
Out[25]: (4, 10000)

In [26]: sys.getsizeof(long)
Out[26]: 160104

In [27]: sys.getsizeof(wide)
Out[27]: 160104

In [28]: wide.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 156.3 KB

In [29]: long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
0    10000 non-null int32
1    10000 non-null int32
2    10000 non-null int32
3    10000 non-null int32
dtypes: int32(4)
memory usage: 156.3 KB

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM