在 Python 中将巨大的镶木地板文件读入 memory 的最有效方法

Most efficient way to read a huge parquet file into memory in Python

Ideally, I would like to have the data in a dictionary.理想情况下,我希望将数据保存在字典中。 I am not even sure if a dictionary is better than a dataframe in this context.在这种情况下,我什至不确定字典是否比 dataframe 更好。 After a bit of research, I found the following ways to read a parquet file into memory:经过一些研究,我发现了以下将 parquet 文件读入 memory 的方法:

  • Pyarrow (Python API of Apache Arrow): Pyarrow(Apache 箭头的 Python API):

With pyarrow, I can read a parquet file into a pyarrow.Table.使用 pyarrow,我可以将镶木地板文件读入 pyarrow.Table。 I can also read the data into a pyarrow.DictionaryArray.我还可以将数据读入 pyarrow.DictionaryArray。 Both are easily convertible into a dataframe, but wouldn't memory consumption double in this case?两者都可以轻松转换为 dataframe,但在这种情况下,memory 的消耗不会翻倍吗?

  • Pandas: Pandas:

Via pd.read_parquet.通过 pd.read_parquet。 The file is read into a dataframe. Again, would a dataframe perform as well as a dictionary?该文件被读入 dataframe。同样,dataframe 的性能和字典一样好吗?

  • parquet-python (pure python, supports read-only): parquet-python(纯python,支持只读):

Supports reading in each row in a parquet as a dictionary.支持将 parquet 中的每一行作为字典读取。 That means I'd have to merge a lot of nano-dictionaries.这意味着我必须合并很多纳米词典。 I am not sure if this is wise.我不确定这是否明智。

Most efficient way to read a huge parquet file into memory in Python, you can consider is using the pyarrow library, which provides high-performance, memory-efficient data structures for working with Parquet files.将巨大的 parquet 文件读入 Python 中的 memory 的最有效方法是使用 pyarrow 库,它为处理 Parquet 文件提供了高性能、内存高效的数据结构。

import pyarrow.parquet as pq

# Read the Parquet file into a Pandas DataFrame
df = pq.read_pandas(path).to_pandas()

# Convert the DataFrame to a NumPy array
data = df.values

