简体   繁体   English

追加大熊猫的to_hdf倍数H5文件大小的行?

[英]appending rows with pandas' to_hdf multiples H5 file size?

I have an HDF5 with about 13,000 rows × 5 columns, these rows were appended over time to the same file with DF.to_hdf(Filename, 'df', append=True, format='table') and here's the size: 我有一个大约13,000行×5列的HDF5 ,随着时间的推移,这些行通过DF.to_hdf(Filename, 'df', append=True, format='table')到同一文件中,大小如下:

-rw-r--r--  1 omnom  omnom   807M Mar 10 15:55 Final_all_result.h5

Recently I received a ValueError because the data I was trying to append to one of the columns is longer than the declared column size (2000, with min_itemsize ). 最近,我收到一个ValueError因为试图添加到其中一列的数据比声明的列长(2000,带有min_itemsize )长。

So I loaded all rows to memory and dump these into a new HDF file at one go with: 因此,我将所有行都加载到内存中,并一次性将它们转储到新的HDF文件

DF.to_hdf(newFilename, \
                'df', \
                mode='a', \
                data_columns=['Code', 'ID', 'Category', 'Title', 'Content'], \
                format='table', \
                min_itemsize={'index': 24, \
                                'Code': 8, \
                                'ID': 32, \
                                'Category': 24, \
                                'Title': 192, \
                                'Content':5000 \
                                } \
                )

I was really surprised that the new file size is about 1/10 of the original file: 我真的很惊讶,新文件的大小约为原始文件的1/10:

-rw-r--r--  1 omnom  omnom    70M Mar 10 16:01 Final_all_result_5000.h5

I double checked the number of rows in both files, they're equal. 我仔细检查了两个文件中的行数,它们相等。

Do I append new rows the wrong way that causes the file size to multiple with every append operation? 我是否以错误的方式附加新行,从而导致每次附加操作都使文件大小倍增? Googled and searched here but don't think this was discussed before, or maybe I searched with the wrong keywords. 在这里进行了Google搜索和搜索,但认为以前没有讨论过,或者我搜索的关键字错误。

Any help is appreciated. 任何帮助表示赞赏。

UPDATE: I tried adding min_itemsize for all data columns in the append line per suggestion in this thread: pandas pytables append: performance and increase in file size : 更新:我尝试根据此线程中的建议为附加行中的所有数据列添加min_itemsizepandas pytables append:性能和文件大小的增加

DF.to_hdf(h5AbsPath, \
                'df', \
                mode='a', \
                data_columns=['Code', 'ID', 'Category', 'Title', 'Content'], \
                min_itemsize={'index': 24, \
                                'Code': 8, \
                                'ID': 32, \
                                'Category': 24, \
                                'Title': 192, \
                                'Content':5000 \
                                }, \
                 append=True \
                 )

but still it doesn't reduce file size. 但仍然不会减小文件大小。

Thanks for suggestions to add compression, both the appended and newly dumped files are not compressed per requirement. 感谢您提出增加压缩的建议,附加文件和新转储文件均未按要求进行压缩。

I used to save .h5 files from pandas DataFrame. 我曾经从pandas DataFrame保存.h5文件。 Try adding complib='blosc' and complevel=9 to the to_hdf() function. 尝试将complib='blosc'complevel=9to_hdf()函数。 This should reduce the file size. 这将减小文件的大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM