简体   繁体   English

Python:同时读取多个大型csv

[英]Python: Read multiple large csv's at the same time

I have 9 large CSVs (12GB each), with exactly the same column structure and row order, just different values in each csv.我有 9 个大型 CSV(每个 12GB),具有完全相同的列结构和行顺序,只是每个 csv 中的值不同。 I need to go through the csv's row by row and compare the data inside them, but they are far too large to store in memory.我需要逐行通过csv的go并比较其中的数据,但是它们太大而无法存储在memory中。 Row order being maintained is highly important as the row position is used as an index for comparing the data between csvs, so appending the tables together isn't ideal.维护行顺序非常重要,因为行 position 用作比较 csv 之间数据的索引,因此将表附加在一起并不理想。

I'd rather avoid 9 nested "with open() as csv:" using DictReader and this seems very messy.我宁愿避免使用 DictReader 使用 9 个嵌套的“open() as csv:”,这看起来很混乱。

I've tried to used pandas and concatenate:我尝试使用 pandas 并连接:

files = [list_of_csv_paths]
result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

but it simply tries to load all the data into memory and I don't have nearly enough RAM.但它只是试图将所有数据加载到 memory 并且我没有足够的 RAM。 Changing the pd.read_csv to have a specific chucksize returns a TypeError.将 pd.read_csv 更改为具有特定的卡盘大小会返回 TypeError。

I've seen that possibly Dash could be used for this but I'm not experienced with Dash.我已经看到可能会使用 Dash,但我没有使用 Dash 的经验。

I'm open to any suggestions.我愿意接受任何建议。

I think this might be a good start - reading by chunks - where chunksize is number of lines by documentation .我认为这可能是一个好的开始 - 按块阅读 - 其中chunksize文档的行数。 That should be the best way of reading huge files.这应该是读取大文件的最佳方式。 You can try to use threading as well to process it faster.您也可以尝试使用线程来更快地处理它。

Simple example:简单的例子:

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Check the skiprows parameter as well.还要检查skiprows参数。 Next example is gonna read lines from 1000 to 2000.下一个示例将读取从 1000 到 2000 的行。

Example:例子:

df = pd.read_csv('file.csv',sep=',', header=None, skiprows=1000, chunksize=1000)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM