[英]What did the HDF5 format do to the csv file?
I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB.我有一个 33GB 的 csv 文件,但在转换为 HDF5 格式后,文件大小急剧减少到 1.4GB 左右。 I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe.我使用 vaex 库读取我的数据集,然后将此 vaex dataframe 转换为 pandas dataframe。 This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM. vaex dataframe 到 pandas dataframe 的这种转换并没有给我的 RAM 带来太多负载。
I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)? I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->熊猫数据框)?
I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.我非常怀疑这与压缩有什么关系。事实上,我认为文件在 hdf5 格式下应该更大,尤其是在存在数字特征的情况下。
How did you convert from csv to hdf5?你是如何从 csv 转换为 hdf5 的? Is the number of columns and rows the same列数和行数是否相同
Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data.假设您使用 vaex 以某种方式转换它,请检查您是否正在查看单个“块”数据。 Vaex will do things in steps and then concatenate the result in a single file. Vaex 将逐步执行操作,然后将结果连接到一个文件中。
Also if some column are of unsupported type they might not be exported.此外,如果某些列的类型不受支持,它们可能不会被导出。
Doing some sanity checks will uncover more hints.做一些健全性检查会发现更多提示。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.