将文件压缩为.h5

Question

I have a.h5 file from which I read some data, I order it in some way and then I save it to another.h5 file.我有一个.h5 文件，我从中读取了一些数据，我以某种方式对其进行排序，然后将其保存到另一个.h5 文件中。 Here is my code:这是我的代码：

import h5py
import numpy as np
import pandas as pd

f = pd.read_hdf("input_file.h5")

dt = f.values

dt2 = np.transpose(np.transpose(dt)[0:2100])
dt3 = np.transpose(np.transpose(dt)[-1])
dt3 = dt3.reshape(1,len(dt3))

d2 = len(dt2[0])
d1 = len(dt2)

dt2 = dt2.reshape((len(dt2), len(dt2[0])//3, 3))
ordered_index = np.flip(dt2[:,:,0].argsort(),1)

dt2 = dt2[np.arange(len(dt2[:,:,0].argsort()))[:,None],ordered_index].reshape((d1,d2))
dt2 = np.transpose(dt2)

data = np.transpose(np.concatenate((dt2,dt3),axis=0))

df=pd.DataFrame(data=data[0:,0:], index=[i for i in range(data.shape[0])], columns=[str(i) for i in range(data.shape[1])])


hf = h5py.File('ordered_pt_data.h5', 'w')
hf.create_dataset('dataset_ordered_pt', data=df)
hf.close()

The program runs fine, and when I print the new data (ie print(df) ) everything looks well (ie the data is ordered the way I want) and the ordered data has the same dimension as the input data.程序运行良好，当我打印新数据（即print(df) ）时，一切看起来都很好（即数据按我想要的方式排序），并且有序数据与输入数据具有相同的维度。 However the input file "input_file.h5" has 2.6GB while the file I create has 18GB.但是输入文件“input_file.h5”有 2.6GB，而我创建的文件有 18GB。 What am I doing wrong?我究竟做错了什么？ Do i need to pass some extra parameter to compress the data more?我是否需要传递一些额外的参数来进一步压缩数据？ Again, the output file contains the exactly same data (both size and type, unless something I did changed the type of the data without me realizing it) as the input file, just in a different order.同样，output 文件包含与输入文件完全相同的数据（大小和类型，除非我在没有意识到的情况下更改了数据的类型），只是顺序不同。 Thank you!谢谢！

Answer 1

You can begin debugging by seeing if data types are the same:您可以通过查看数据类型是否相同来开始调试：


# ...

print('f dtypes and memory usage')
print(f.info(memory_usage='deep'))


print('df dtypes and memory usage')
print(df.info(memory_usage='deep'))

Check the memory usage:检查 memory 用法：

# ...
print('f memory usage')
print(f.memory_usage(deep=True)

print('df memory')
print(df.memory_usage(deep=True))

If everything is the same, namely same data types, same numbers of rows and columns.如果一切都相同，即相同的数据类型，相同的行数和列数。 Then the issue is compression.然后是压缩问题。

Per documentation you can compress your data as follows根据文档，您可以按如下方式压缩数据

with h5py.File('ordered_pt_data.h5', 'w') as hf:
    hf.create_dataset('dataset_ordered_pt', data=df, compression="gzip", compression_opts=9)

See: doc for more options and details有关更多选项和详细信息，请参阅： doc

将文件压缩为.h5

问题描述

1 个解决方案

解决方案1
1 2019-10-31 06:14:26

将文件压缩为.h5

问题描述

1 个解决方案

解决方案1 1 2019-10-31 06:14:26

解决方案1
1 2019-10-31 06:14:26