简体繁体 English

在python中仅读取特定行的最有效的文件类型（非常大的文件）

[英]Most efficient file type to read in only specific rows, in python (very large files)

原文 2021-11-03 11:08:16 9 2 python/ csv/ file/ storage/ parquet

I am currently using parquet files due to their outstanding read-in time.由于读取时间很长，我目前正在使用镶木地板文件。 However now I am looking to change functionality of my program slightly.但是现在我希望稍微改变我的程序的功能。 Files will become too large for the memory and instead I wish to read in only specific rows of the files.文件对于内存来说会变得太大，而我希望只读取文件的特定行。

The files have around 15gb of data each (and I will be using multiple files), with several hundred columns, and millions of rows.每个文件有大约 15GB 的数据（我将使用多个文件），有数百列和数百万行。 If I wanted to read in eg only row x, operate on that, and then read in a new row (millions of times over), what would be the most efficient file type by which to do this?如果我只想读入例如第 x 行，对其进行操作，然后读入新行（数百万次），那么执行此操作的最有效文件类型是什么？

I am not too concerned about compression, as it is ram that is my limiting factor, rather than storage.我不太关心压缩，因为内存是我的限制因素，而不是存储。

Thanks in advance for your expertise!在此先感谢您的专业知识！

2 个解决方案

Most likely you will not get everything right processing your date.很可能你不会在处理你的日期时得到正确的处理。 If the raw data is stored as CSV, save not just debugging time and convert CSV to parquet using ie:如果原始数据存储为 CSV，不仅可以节省调试时间，还可以使用 ie 将 CSV 转换为 parquet：

Depending on exact requirements, I'd look at:根据确切的要求，我会看：

SQLite SQLite
RocksDB (using "row number" as the key). RocksDB （使用“行号”作为键）。

Note that Rocks DB will produce multiple files in a single directory rather than an individual file.请注意，Rocks DB 将在单个目录中生成多个文件，而不是单个文件。 Last i looked Rocks DB did not support secondary indexes so you are stuck with whatever choice you make for the Key, unless you want to rewrite the data.上次我查看 Rocks DB 不支持二级索引，因此除非您想重写数据，否则您必须为 Key 做出任何选择。 The RocksDB project does not have python bindings but there are a few floating around on github. RocksDB 项目没有 python 绑定，但在 github 上有一些浮动。

SQLLite, at least, for the initial load might be pretty slow (I would recommend loading then creating an index on "row number") after the initial load.至少，对于初始加载，SQLLite 可能会很慢（我建议在初始加载后加载然后在“行号”上创建索引）。 But it allows for creating secondary indices and finding multiple rows at a time by those indices reasonably efficiently.但它允许创建二级索引并通过这些索引合理有效地一次查找多行。