简体   繁体   English

HDF5 格式对 csv 文件做了什么?

[英]What did the HDF5 format do to the csv file?

I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB.我有一个 33GB 的 csv 文件,但在转换为 HDF5 格式后,文件大小急剧减少到 1.4GB 左右。 I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe.我使用 vaex 库读取我的数据集,然后将此 vaex dataframe 转换为 pandas dataframe。 This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM. vaex dataframe 到 pandas dataframe 的这种转换并没有给我的 RAM 带来太多负载。

I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)? I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->熊猫数据框)?

  1. HDF5 compresses the data, that explains lower amount of disk space used. HDF5 压缩数据,这解释了使用的磁盘空间量较低。
  2. In terms of RAM , I wouldn't expect any difference at all, with maybe even slightly more memory consumed by HDF5 due to computations related to format processing.RAM方面,我不认为有任何区别,由于与格式处理相关的计算,HDF5 消耗的 memory 可能会稍微多一些。

I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.我非常怀疑这与压缩有什么关系。事实上,我认为文件在 hdf5 格式下应该更大,尤其是在存在数字特征的情况下。

How did you convert from csv to hdf5?你是如何从 csv 转换为 hdf5 的? Is the number of columns and rows the same列数和行数是否相同

Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data.假设您使用 vaex 以某种方式转换它,请检查您是否正在查看单个“块”数据。 Vaex will do things in steps and then concatenate the result in a single file. Vaex 将逐步执行操作,然后将结果连接到一个文件中。

Also if some column are of unsupported type they might not be exported.此外,如果某些列的类型不受支持,它们可能不会被导出。

Doing some sanity checks will uncover more hints.做一些健全性检查会发现更多提示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM