简体   繁体   English

在python中仅读取特定行的最有效的文件类型(非常大的文件)

[英]Most efficient file type to read in only specific rows, in python (very large files)

I am currently using parquet files due to their outstanding read-in time.由于读取时间很长,我目前正在使用镶木地板文件。 However now I am looking to change functionality of my program slightly.但是现在我希望稍微改变我的程序的功能。 Files will become too large for the memory and instead I wish to read in only specific rows of the files.文件对于内存来说会变得太大,而我希望只读取文件的特定行。

The files have around 15gb of data each (and I will be using multiple files), with several hundred columns, and millions of rows.每个文件有大约 15GB 的数据(我将使用多个文件),有数百列和数百万行。 If I wanted to read in eg only row x, operate on that, and then read in a new row (millions of times over), what would be the most efficient file type by which to do this?如果我只想读入例如第 x 行,对其进行操作,然后读入新行(数百万次),那么执行此操作的最有效文件类型是什么?

I am not too concerned about compression, as it is ram that is my limiting factor, rather than storage.我不太关心压缩,因为内存是我的限制因素,而不是存储。

Thanks in advance for your expertise!在此先感谢您的专业知识!

Most likely you will not get everything right processing your date.很可能你不会在处理你的日期时得到正确的处理。 If the raw data is stored as CSV, save not just debugging time and convert CSV to parquet using ie:如果原始数据存储为 CSV,不仅可以节省调试时间,还可以使用 ie 将 CSV 转换为 parquet:

Depending on exact requirements, I'd look at:根据确切的要求,我会看:

Note that Rocks DB will produce multiple files in a single directory rather than an individual file.请注意,Rocks DB 将在单个目录中生成多个文件,而不是单个文件。 Last i looked Rocks DB did not support secondary indexes so you are stuck with whatever choice you make for the Key, unless you want to rewrite the data.上次我查看 Rocks DB 不支持二级索引,因此除非您想重写数据,否则您必须为 Key 做出任何选择。 The RocksDB project does not have python bindings but there are a few floating around on github. RocksDB 项目没有 python 绑定,但在 github 上有一些浮动。

SQLLite, at least, for the initial load might be pretty slow (I would recommend loading then creating an index on "row number") after the initial load.至少,对于初始加载,SQLLite 可能会很慢(我建议在初始加载后加载然后在“行号”上创建索引)。 But it allows for creating secondary indices and finding multiple rows at a time by those indices reasonably efficiently.但它允许创建二级索引并通过这些索引合理有效地一次查找多行。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取大型 csv 文件中特定列的最有效方法 - Most efficient way to read a specific column in large csv file 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python 大文件中最省时的搜索-Python - Most time efficient search in a large file - Python 迭代一个非常大的表并更新行的最有效方法是什么? - what is the most efficient way to iterate over a very large table and update the rows? 在Python中修改大型文本文件的最后一行的最有效方法 - Most efficient way to modify the last line of a large text file in Python Python从命令行读取文件,并使用非常大的文件剥离“\ n \ r” - Python read file from command line and strip “\n\r” with very large files 使用 Python 读取位于 S3 (AWS) 上的大型 CSV 文件(10 M+ 条记录)的最有效方法是什么? - What is the most efficient way to read a large CSV file ( 10 M+ records) located on S3 (AWS) with Python? 在 Python/MicroPython 中存储非常大的 2D 数组的最有效方法 - The most efficient way to store a very large 2D array in Python/MicroPython 使用带有非常大文件的python将带有if语句的行附加到for循环中的数据帧 - append rows with if statement in for loop to a dataframe using python with very large files python中的multiprocessing.map具有非常大的只读对象? - multiprocessing.map in python with very large read-only object?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM