[英]Most efficient way to read a specific column in large csv file
There is a CSV file approx.大约有一个 CSV 文件。 size of 2,5 GB with about 50 columns and 4,5 million rows.
大小为 2.5 GB,大约有 50 列和 450 万行。
The dataset will be used for different operations, but at once just a few columns are used, therefore I am looking for a high performant algorithm to read only one column in a CSV file.该数据集将用于不同的操作,但一次只使用几列,因此我正在寻找一种高性能算法来只读取 CSV 文件中的一列。
Reading the file in one chunk takes roughly 38 seconds to read it in a Pandas dataframe.读取一个块中的文件大约需要 38 秒才能读取一个 Pandas dataframe 中的文件。
path = r"C:\my_path\my_csv.csv" pd.read_csv(path, header=0)
Reading only one specific column takes about 14 seconds仅阅读一个特定的列大约需要 14 秒
pd.read_csv(path, usecols=["my_specific_col"], header=0)
Is there a way to reduce the reading time?有没有办法减少阅读时间? As it seems that the number of columns has little effect on the performance.
因为看起来列数对性能影响不大。
Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv , relying on the Arrow library's CSV multithreaded parser instead of the default C parser.自 Pandas 的 1.4.0 版以来,有一个新的read_csv实验引擎,它依赖于 Arrow 库的 CSV 多线程解析器,而不是默认的 C 解析器。
So, this might help to speed things up:所以,这可能有助于加快速度:
df = pd.read_csv(path, usecols=["my_specific_col"], header=0, engine="pyarrow")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.