在 Python 中将巨大的镶木地板文件读入 memory 的最有效方法

Question

Ideally, I would like to have the data in a dictionary.理想情况下，我希望将数据保存在字典中。 I am not even sure if a dictionary is better than a dataframe in this context.在这种情况下，我什至不确定字典是否比 dataframe 更好。 After a bit of research, I found the following ways to read a parquet file into memory:经过一些研究，我发现了以下将 parquet 文件读入 memory 的方法：

Pyarrow (Python API of Apache Arrow): Pyarrow（Apache 箭头的 Python API）：

With pyarrow, I can read a parquet file into a pyarrow.Table.使用 pyarrow，我可以将镶木地板文件读入 pyarrow.Table。 I can also read the data into a pyarrow.DictionaryArray.我还可以将数据读入 pyarrow.DictionaryArray。 Both are easily convertible into a dataframe, but wouldn't memory consumption double in this case?两者都可以轻松转换为 dataframe，但在这种情况下，memory 的消耗不会翻倍吗？

Pandas: Pandas：

Via pd.read_parquet.通过 pd.read_parquet。 The file is read into a dataframe. Again, would a dataframe perform as well as a dictionary?该文件被读入 dataframe。同样，dataframe 的性能和字典一样好吗？

parquet-python (pure python, supports read-only): parquet-python（纯python，支持只读）：

Supports reading in each row in a parquet as a dictionary.支持将 parquet 中的每一行作为字典读取。 That means I'd have to merge a lot of nano-dictionaries.这意味着我必须合并很多纳米词典。 I am not sure if this is wise.我不确定这是否明智。

Answer 1

Most efficient way to read a huge parquet file into memory in Python, you can consider is using the pyarrow library, which provides high-performance, memory-efficient data structures for working with Parquet files.将巨大的 parquet 文件读入 Python 中的 memory 的最有效方法是使用 pyarrow 库，它为处理 Parquet 文件提供了高性能、内存高效的数据结构。

import pyarrow.parquet as pq

# Read the Parquet file into a Pandas DataFrame
df = pq.read_pandas(path).to_pandas()

# Convert the DataFrame to a NumPy array
data = df.values

在 Python 中将巨大的镶木地板文件读入 memory 的最有效方法

问题描述

1 个解决方案

解决方案1
0 2023-01-02 11:02:36

在 Python 中将巨大的镶木地板文件读入 memory 的最有效方法

问题描述

1 个解决方案

解决方案1 0 2023-01-02 11:02:36

解决方案1
0 2023-01-02 11:02:36