简体   繁体   English

在 Python 中将巨大的镶木地板文件读入 memory 的最有效方法

[英]Most efficient way to read a huge parquet file into memory in Python

Ideally, I would like to have the data in a dictionary.理想情况下,我希望将数据保存在字典中。 I am not even sure if a dictionary is better than a dataframe in this context.在这种情况下,我什至不确定字典是否比 dataframe 更好。 After a bit of research, I found the following ways to read a parquet file into memory:经过一些研究,我发现了以下将 parquet 文件读入 memory 的方法:

  • Pyarrow (Python API of Apache Arrow): Pyarrow(Apache 箭头的 Python API):

With pyarrow, I can read a parquet file into a pyarrow.Table.使用 pyarrow,我可以将镶木地板文件读入 pyarrow.Table。 I can also read the data into a pyarrow.DictionaryArray.我还可以将数据读入 pyarrow.DictionaryArray。 Both are easily convertible into a dataframe, but wouldn't memory consumption double in this case?两者都可以轻松转换为 dataframe,但在这种情况下,memory 的消耗不会翻倍吗?

  • Pandas: Pandas:

Via pd.read_parquet.通过 pd.read_parquet。 The file is read into a dataframe. Again, would a dataframe perform as well as a dictionary?该文件被读入 dataframe。同样,dataframe 的性能和字典一样好吗?

  • parquet-python (pure python, supports read-only): parquet-python(纯python,支持只读):

Supports reading in each row in a parquet as a dictionary.支持将 parquet 中的每一行作为字典读取。 That means I'd have to merge a lot of nano-dictionaries.这意味着我必须合并很多纳米词典。 I am not sure if this is wise.我不确定这是否明智。

Most efficient way to read a huge parquet file into memory in Python, you can consider is using the pyarrow library, which provides high-performance, memory-efficient data structures for working with Parquet files.将巨大的 parquet 文件读入 Python 中的 memory 的最有效方法是使用 pyarrow 库,它为处理 Parquet 文件提供了高性能、内存高效的数据结构。

import pyarrow.parquet as pq

# Read the Parquet file into a Pandas DataFrame
df = pq.read_pandas(path).to_pandas()

# Convert the DataFrame to a NumPy array
data = df.values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python 使用Python读取,处理和写入文件的最有效(或专业)方式? - The most efficient (or professional) way to read, proccess and write a file with Python? 在不使用熊猫的情况下,以可重用的代码提取巨大的csv日志文件的有效方式(无需在内存中读取)? - Efficient way(No in-memory read) of extracting a huge csv log file in reusable code without using pandas? 在巨大的列表中查找/搜索的最有效方法(python) - Most efficient way for a lookup/search in a huge list (python) Python:如何将巨大的文本文件读入内存 - Python: How to read huge text file into memory 在 Python 中读取 Parquet 文件的内存占用最少的方法是什么? 可以逐行吗? - What is the least memory-intensive way to read a Parquet file in Python? Is line-by-line possible? 最有效的遍历文件结构Python的方法 - Most efficient way to traverse file structure Python 在Python中解析文件的最有效方法 - Most efficient way to parse a file in Python 更清洁的方式来读取/ gunzip python中的一个巨大的文件 - Cleaner way to read/gunzip a huge file in python 在python中生成集合组合的最有效内存方法是什么? - What's the most memory efficient way to generate the combinations of a set in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM