简体繁体 English

当我通过 NFS 处理 GB 大小的文件时，如何在 Python 中优化文件 I/O？

[英]How can I optimize file I/O in Python when I process GB-sized files via NFS?

原文 2020-01-29 03:07:35 7 1 python/ pandas/ bigdata/ nfs

I'm manipulating several files via nfs, due to security concerns.出于安全考虑，我正在通过 nfs 操作多个文件。 The situation is very painful to process something due to slow file I/O.由于文件 I/O 缓慢，处理某些事情的情况非常痛苦。 Followings are descriptions of the issue.以下是对该问题的描述。

I use pandas in Python to do simple processing on data.我在 Python 中使用 pandas 对数据进行简单处理。 So I use read_csv() and to_csv() frequently.所以我经常使用read_csv()和to_csv() 。
Currently, writing of a 10GB csv file requires nearly 30 mins whereas reading consumes 2 mins.目前，写入一个 10GB 的 csv 文件需要将近 30 分钟，而读取则需要 2 分钟。
I have enough CPU cores (> 20 cores) and memory (50G~100G).我有足够的 CPU 内核（> 20 个内核）和内存（50G~100G）。
It is hard to ask more bandwidth.很难要求更多的带宽。
I need to access data in column-oriented manner, frequently.我需要经常以面向列的方式访问数据。 For example, there would be 100M records with 20 columns (most of them are numeric data).例如，将有 20 列的 100M 记录（其中大部分是数字数据）。 For the data, I frequently read all of 100M records only for 3~4 columns' value.对于数据，我经常只读取 3~4 列值的所有 100M 记录。
I've tried with HDF5, but it constructs a larger file and consumes similar time to write.我已经尝试过 HDF5，但它构建了一个更大的文件并消耗了相似的写入时间。 And it does not provide column-oriented I/O.并且它不提供面向列的 I/O。 So I've discarded this option.所以我放弃了这个选项。
I cannot store them locally.我无法将它们存储在本地。 It would violate many security criteria.它会违反许多安全标准。 Actually I'm working on virtual machine and file system is mounted via nfs.实际上我正在使用虚拟机，文件系统是通过 nfs 挂载的。
I repeatedly read several columns.我反复阅读了几个专栏。 For several columns, no.对于几列，没有。 The task is something like data analysis.任务类似于数据分析。

Which approaches can I consider?我可以考虑哪些方法？ In several cases, I use sqlite3 to manipulate data in simple way and exports results into csv files.在一些情况下，我使用 sqlite3 以简单的方式操作数据并将结果导出到 csv 文件中。 Can I accelerate I/O tasks by using sqlite3 in Python?我可以通过在 Python 中使用 sqlite3 来加速 I/O 任务吗？ If it provide column-wise operation, it would be a good solution, I reckon.我认为，如果它提供按列操作，那将是一个很好的解决方案。

1 个解决方案

two options: pandas hdf5 or dask.两个选项：pandas hdf5 或 dask。

you can review hdf5 format with format='table'.您可以使用 format='table' 查看 hdf5 格式。

HDFStore supports another PyTables format on disk, the table format. HDFStore 支持磁盘上的另一种 PyTables 格式，即表格格式。 Conceptually a table is shaped very much like a DataFrame, with rows and columns.从概念上讲，表格的形状非常类似于 DataFrame，具有行和列。 A table may be appended to in the same or other sessions.可以在相同或其他会话中附加表。 In addition, delete and query type operations are supported.此外，还支持删除和查询类型的操作。 This format is specified by format='table' or format='t' to append or put or to_hdf.此格式由 format='table' 或 format='t' 指定以追加或放置或 to_hdf。