[英]compressing files to .h5
I have a.h5 file from which I read some data, I order it in some way and then I save it to another.h5 file.我有一个.h5 文件,我从中读取了一些数据,我以某种方式对其进行排序,然后将其保存到另一个.h5 文件中。 Here is my code:
这是我的代码:
import h5py
import numpy as np
import pandas as pd
f = pd.read_hdf("input_file.h5")
dt = f.values
dt2 = np.transpose(np.transpose(dt)[0:2100])
dt3 = np.transpose(np.transpose(dt)[-1])
dt3 = dt3.reshape(1,len(dt3))
d2 = len(dt2[0])
d1 = len(dt2)
dt2 = dt2.reshape((len(dt2), len(dt2[0])//3, 3))
ordered_index = np.flip(dt2[:,:,0].argsort(),1)
dt2 = dt2[np.arange(len(dt2[:,:,0].argsort()))[:,None],ordered_index].reshape((d1,d2))
dt2 = np.transpose(dt2)
data = np.transpose(np.concatenate((dt2,dt3),axis=0))
df=pd.DataFrame(data=data[0:,0:], index=[i for i in range(data.shape[0])], columns=[str(i) for i in range(data.shape[1])])
hf = h5py.File('ordered_pt_data.h5', 'w')
hf.create_dataset('dataset_ordered_pt', data=df)
hf.close()
The program runs fine, and when I print the new data (ie print(df)
) everything looks well (ie the data is ordered the way I want) and the ordered data has the same dimension as the input data.程序运行良好,当我打印新数据(即
print(df)
)时,一切看起来都很好(即数据按我想要的方式排序),并且有序数据与输入数据具有相同的维度。 However the input file "input_file.h5" has 2.6GB while the file I create has 18GB.但是输入文件“input_file.h5”有 2.6GB,而我创建的文件有 18GB。 What am I doing wrong?
我究竟做错了什么? Do i need to pass some extra parameter to compress the data more?
我是否需要传递一些额外的参数来进一步压缩数据? Again, the output file contains the exactly same data (both size and type, unless something I did changed the type of the data without me realizing it) as the input file, just in a different order.
同样,output 文件包含与输入文件完全相同的数据(大小和类型,除非我在没有意识到的情况下更改了数据的类型),只是顺序不同。 Thank you!
谢谢!
You can begin debugging by seeing if data types are the same:您可以通过查看数据类型是否相同来开始调试:
# ...
print('f dtypes and memory usage')
print(f.info(memory_usage='deep'))
print('df dtypes and memory usage')
print(df.info(memory_usage='deep'))
Check the memory usage:检查 memory 用法:
# ...
print('f memory usage')
print(f.memory_usage(deep=True)
print('df memory')
print(df.memory_usage(deep=True))
If everything is the same, namely same data types, same numbers of rows and columns.如果一切都相同,即相同的数据类型,相同的行数和列数。 Then the issue is compression.
然后是压缩问题。
Per documentation you can compress your data as follows根据文档,您可以按如下方式压缩数据
with h5py.File('ordered_pt_data.h5', 'w') as hf:
hf.create_dataset('dataset_ordered_pt', data=df, compression="gzip", compression_opts=9)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.