简体   繁体   English

将巨大的csv转换为hdf5格式

[英]Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset ;我下载了 IBM 的Airline Reporting Carrier On-Time Performance Dataset the uncompressed CSV is 84 GB.未压缩的 CSV 是 84 GB。 I want to run an analysis, similar to Flying high with Vaex , with the vaex libary.我想使用 vaex 库运行一个分析,类似于Flying high with Vaex 。

I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:我尝试将 CSV 转换为 hdf5 文件,以使其对 vaex 库可读:

import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time() 
print("Time:",(end-start),"Seconds")

I always get an error when running the code:运行代码时总是出现错误:

RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).

Second run, I get this error:第二次运行,我得到这个错误:

RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)

Is there an alternative way to convert the CSV to hdf5 without Python?有没有另一种方法可以在没有 Python 的情况下将 CSV 转换为 hdf5? For example, a downloadable software which can do this job?例如,一个可以完成这项工作的可下载软件?

I'm not familiar with vaex, so can't help with usage and functions.我对vaex不熟悉,所以无法提供使用和功能方面的帮助。 However, I can read error messages.但是,我可以阅读错误消息。 :-) :-)

It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:它报告“写入的字节”数量巨大(18_446_744_073_709_551_615),比 84GB CSV 大得多。一些可能的解释:

  1. you ran out of disk你的磁盘用完了
  2. you ran out of memory, or你用完了 memory,或者
  3. had some other error还有其他错误

To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected.要进行诊断,请尝试使用一个小的 csv 文件进行测试,看看vaex.from_csv()是否按预期工作。 I suggest the lax_to_jfk.csv file.我建议使用lax_to_jfk.csv文件。

Regarding your question, is there an alternative way to convert a csv to hdf5?关于您的问题,是否有其他方法可以将 csv 转换为 hdf5? , why not use Python? ,为什么不用Python呢?

Are you more comfortable with other languages?你对其他语言更熟悉吗? If so, you can install HDF5 and write your code with their C or Fortran API.如果是这样,您可以安装 HDF5 并使用他们的 C 或 Fortran API 编写代码。

OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file. OTOH,如果您熟悉 Python,还有其他包可用于读取 CSV 文件并创建 HDF5 文件。

Python packages to read the CSV Python包读取CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file.就个人而言,我喜欢 NumPy 的genfromtxt()来读取 CSV(如果你没有缺失值并且不需要字段名称,你也可以使用loadtxt()来读取 CSV。)但是,我认为你会运行进入 memory 读取 84GB 文件的问题。 That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines.也就是说,您可以将skip_headermax_rows参数与genfromtxt()一起使用来读取和加载行的子集。 Alternately you can use csv.DictReader() .或者,您可以使用csv.DictReader() It reads a line at a time.它一次读取一行。 So, you avoid memory issues, but it could be very slow loading the HDF5 file.因此,您可以避免 memory 问题,但加载 HDF5 文件的速度可能会非常慢。

Python packages to create the HDF5 file Python 个包来创建 HDF5 文件
I have used both h5py and pytables (aka tables) to create and read HDF5 files.我已经使用 h5py 和 pytables(又名表)来创建和读取 HDF5 文件。 Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.将 CSV 数据加载到 NumPy 数组后,即可轻而易举地创建 HDF5 数据集。

Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.这是一个非常简单的示例,它读取lax_to_jfk.csv数据并加载到 HDF5 文件。

csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
                        dtype=None, names=True, encoding='bytes')

with h5py.File(csv_name+'.h5', 'w') as h5f:
    h5f.create_dataset(csv_name,data=rec_arr)

Update :更新
After posting this example, I decided to test with a larger file ( airline_2m.csv ).发布此示例后,我决定使用更大的文件 ( airline_2m.csv ) 进行测试。 It's 861 MB, and has 2M rows.它是 861 MB,有 2M 行。 I discovered the code above doesn't work.我发现上面的代码不起作用。 However, it's not because of the number of rows.但是,这不是因为行数。 The problem is the columns (field names).问题是列(字段名称)。 Turns out the data isn't as clean;事实证明数据不那么干净; there are 109 field names on row 1, and some rows have 111 columns of data.第一行有109个字段名,有些行有111列数据。 As a result, the auto-generated dtype doesn't have a matching field.因此,自动生成的 dtype 没有匹配的字段。 While investigating this, I also discovered many rows only have the values for first 56 fields.在对此进行调查时,我还发现许多行只有前 56 个字段的值。 In other words, fields 57-111 are not very useful.换句话说,字段 57-111 不是很有用。 One solution to this is to add the usecols=() parameter.一种解决方案是添加usecols=()参数。 Code below reflects this modification, and works with this test file.下面的代码反映了这一修改,并适用于该测试文件。 (I have not tried testing with your large file airline.csv . Given it's size likely you will need to read and load incrementally.) (我没有尝试使用您的大文件airline.csv进行测试。考虑到它的大小,您可能需要逐步读取和加载。)

csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
                        dtype=None, names=True, encoding='bytes') #,
                        usecols=(i for i in range(56)) )

with h5py.File(csv_name+'.h5', 'w') as h5f:
    h5f.create_dataset(csv_name,data=rec_arr)

I tried reproducing your example.我试着重现你的例子。 I believe the problem you are facing is quite common when dealing with CSVs.我相信您在处理 CSV 时遇到的问题很常见。 The schema is not known.架构未知。

Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object .有时存在“混合类型”,并且 pandas(在 vaex 的read_csvfrom_csv下使用)将这些列转换为object

Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database). Vaex 并不真正支持这种混合数据类型,并且要求每一列都是单一的统一类型(有点像数据库)。

So how to go around this?那么如何绕过go呢? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types).好吧,我能想到的最好方法是使用dtype参数明确指定所有列的类型(或者您怀疑或知道具有混合类型的列)。 I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...我知道这个文件有 100 多列,这很烦人……但这也是使用 CSV 等格式时要付出的代价……

Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1" .我注意到的另一件事是编码.. 使用纯pandas.read_csv在某些时候由于编码而失败,需要添加encoding="ISO-8859-1" This is also supported by vaex.open (since the args are just passed down to pandas). vaex.open也支持这一点(因为 args 只是传递给 pandas)。

In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)事实上,如果您想手动执行vaex.open自动为您执行的操作(假设这个 CSV 文件可能不像人们希望的那样干净),请执行类似的操作(这是伪代码,但我希望接近真实代码)

# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
    # Assert or check or do whatever needs doing to ensure column types are as they should be
    
    # Pass the data to vaex (this does not take extra RAM):
    df_vaex = vaex.from_pandas(df_tmp)
    # Export this chunk into HDF5
    # df_vaex.export_hdf5(f'chunk_{i}.hdf5')

# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')

I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.我已经看到使用 vaex 执行此操作可能更好/更快的方法,但它尚未发布(我在 github 上的代码回购中看到它),所以我不会将 go 添加到其中,但如果你可以从源代码安装,并希望我进一步阐述随时发表评论。

Hope this at least gives some ideas on how to move forward.希望这至少能提供一些关于如何前进的想法。

EDIT: In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats编辑:在 vaex 核心的最后几个版本中, vaex.open()延迟打开所有 CSV 文件,因此只需直接导出到 hdf5/arrow,它将在一个 go 中完成。查看文档以获取更多详细信息: https:/ /vaex.io/docs/guides/io.html#Text-based-file-formats

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM