[英]Most efficient way to read a huge parquet file into memory in Python
Ideally, I would like to have the data in a dictionary.理想情况下,我希望将数据保存在字典中。 I am not even sure if a dictionary is better than a dataframe in this context.
在这种情况下,我什至不确定字典是否比 dataframe 更好。 After a bit of research, I found the following ways to read a parquet file into memory:
经过一些研究,我发现了以下将 parquet 文件读入 memory 的方法:
With pyarrow, I can read a parquet file into a pyarrow.Table.使用 pyarrow,我可以将镶木地板文件读入 pyarrow.Table。 I can also read the data into a pyarrow.DictionaryArray.
我还可以将数据读入 pyarrow.DictionaryArray。 Both are easily convertible into a dataframe, but wouldn't memory consumption double in this case?
两者都可以轻松转换为 dataframe,但在这种情况下,memory 的消耗不会翻倍吗?
Via pd.read_parquet.通过 pd.read_parquet。 The file is read into a dataframe. Again, would a dataframe perform as well as a dictionary?
该文件被读入 dataframe。同样,dataframe 的性能和字典一样好吗?
Supports reading in each row in a parquet as a dictionary.支持将 parquet 中的每一行作为字典读取。 That means I'd have to merge a lot of nano-dictionaries.
这意味着我必须合并很多纳米词典。 I am not sure if this is wise.
我不确定这是否明智。
Most efficient way to read a huge parquet file into memory in Python, you can consider is using the pyarrow library, which provides high-performance, memory-efficient data structures for working with Parquet files.将巨大的 parquet 文件读入 Python 中的 memory 的最有效方法是使用 pyarrow 库,它为处理 Parquet 文件提供了高性能、内存高效的数据结构。
import pyarrow.parquet as pq
# Read the Parquet file into a Pandas DataFrame
df = pq.read_pandas(path).to_pandas()
# Convert the DataFrame to a NumPy array
data = df.values
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.