读取大型 csv 文件中特定列的最有效方法

Question

There is a CSV file approx.大约有一个 CSV 文件。 size of 2,5 GB with about 50 columns and 4,5 million rows.大小为 2.5 GB，大约有 50 列和 450 万行。

The dataset will be used for different operations, but at once just a few columns are used, therefore I am looking for a high performant algorithm to read only one column in a CSV file.该数据集将用于不同的操作，但一次只使用几列，因此我正在寻找一种高性能算法来只读取 CSV 文件中的一列。

Reading the file in one chunk takes roughly 38 seconds to read it in a Pandas dataframe.读取一个块中的文件大约需要 38 秒才能读取一个 Pandas dataframe 中的文件。
```
 path = r"C:\my_path\my_csv.csv" pd.read_csv(path, header=0)
```
Reading only one specific column takes about 14 seconds仅阅读一个特定的列大约需要 14 秒

pd.read_csv(path, usecols=["my_specific_col"], header=0)

Is there a way to reduce the reading time?有没有办法减少阅读时间？ As it seems that the number of columns has little effect on the performance.因为看起来列数对性能影响不大。

Answer 1

Since version 1.4.0 of Pandas, there is a new experimental engine for read_csv , relying on the Arrow library's CSV multithreaded parser instead of the default C parser.自 Pandas 的 1.4.0 版以来，有一个新的read_csv实验引擎，它依赖于 Arrow 库的 CSV 多线程解析器，而不是默认的 C 解析器。

So, this might help to speed things up:所以，这可能有助于加快速度：

df = pd.read_csv(path, usecols=["my_specific_col"], header=0, engine="pyarrow")

读取大型 csv 文件中特定列的最有效方法

问题描述

1 个解决方案

解决方案1
0 2023-01-07 19:01:46

读取大型 csv 文件中特定列的最有效方法

问题描述

1 个解决方案

解决方案1 0 2023-01-07 19:01:46

解决方案1
0 2023-01-07 19:01:46