I have a.h5 file from which I read some data, I order it in some way and then I save it to another.h5 file. Here is my code:
import h5py
import numpy as np
import pandas as pd
f = pd.read_hdf("input_file.h5")
dt = f.values
dt2 = np.transpose(np.transpose(dt)[0:2100])
dt3 = np.transpose(np.transpose(dt)[-1])
dt3 = dt3.reshape(1,len(dt3))
d2 = len(dt2[0])
d1 = len(dt2)
dt2 = dt2.reshape((len(dt2), len(dt2[0])//3, 3))
ordered_index = np.flip(dt2[:,:,0].argsort(),1)
dt2 = dt2[np.arange(len(dt2[:,:,0].argsort()))[:,None],ordered_index].reshape((d1,d2))
dt2 = np.transpose(dt2)
data = np.transpose(np.concatenate((dt2,dt3),axis=0))
df=pd.DataFrame(data=data[0:,0:], index=[i for i in range(data.shape[0])], columns=[str(i) for i in range(data.shape[1])])
hf = h5py.File('ordered_pt_data.h5', 'w')
hf.create_dataset('dataset_ordered_pt', data=df)
hf.close()
The program runs fine, and when I print the new data (ie print(df)
) everything looks well (ie the data is ordered the way I want) and the ordered data has the same dimension as the input data. However the input file "input_file.h5" has 2.6GB while the file I create has 18GB. What am I doing wrong? Do i need to pass some extra parameter to compress the data more? Again, the output file contains the exactly same data (both size and type, unless something I did changed the type of the data without me realizing it) as the input file, just in a different order. Thank you!
You can begin debugging by seeing if data types are the same:
# ...
print('f dtypes and memory usage')
print(f.info(memory_usage='deep'))
print('df dtypes and memory usage')
print(df.info(memory_usage='deep'))
Check the memory usage:
# ...
print('f memory usage')
print(f.memory_usage(deep=True)
print('df memory')
print(df.memory_usage(deep=True))
If everything is the same, namely same data types, same numbers of rows and columns. Then the issue is compression.
Per documentation you can compress your data as follows
with h5py.File('ordered_pt_data.h5', 'w') as hf:
hf.create_dataset('dataset_ordered_pt', data=df, compression="gzip", compression_opts=9)
See: doc for more options and details
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.