wide vs long format when saving data in pandas hdf5

Question

pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format.

I'm wondering which format is faster to read and occupies less memory when saved as hdf file ( df.to_hdf ).

Is there a general rule or some cases where one of the format should be preferred?

Answer 1

IMO long format is much more preferable as you will have much less metadata overhead (information about column names, dtypes, etc.).

In term of memory usage they are going to be more or less the same:

In [22]: long = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 4)))

In [23]: wide = pd.DataFrame(np.random.randint(0, 10**6, (4, 10**4)))

In [24]: long.shape
Out[24]: (10000, 4)

In [25]: wide.shape
Out[25]: (4, 10000)

In [26]: sys.getsizeof(long)
Out[26]: 160104

In [27]: sys.getsizeof(wide)
Out[27]: 160104

In [28]: wide.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 156.3 KB

In [29]: long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
0    10000 non-null int32
1    10000 non-null int32
2    10000 non-null int32
3    10000 non-null int32
dtypes: int32(4)
memory usage: 156.3 KB

wide vs long format when saving data in pandas hdf5

Question

1 answers

solution1
0 2016-11-11 10:36:10

wide vs long format when saving data in pandas hdf5

Question

1 answers

solution1 0 2016-11-11 10:36:10

solution1
0 2016-11-11 10:36:10