简体   繁体   English

如何将 Parquet 文件读入 Pandas DataFrame?

[英]How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark?如何在不设置集群计算基础设施(如 Hadoop 或 Spark)的情况下将中等大小的 Parquet 数据集读入内存中的 Pandas DataFrame? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop.这只是我想在笔记本电脑上使用简单的 Python 脚本在内存中读取的适量数据。 The data does not reside on HDFS.数据不驻留在 HDFS 上。 It is either on the local file system or possibly in S3.它要么在本地文件系统上,要么在 S3 中。 I do not want to spin up and configure other services like Hadoop, Hive or Spark.我不想启动和配置其他服务,如 Hadoop、Hive 或 Spark。

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.我认为 Blaze/Odo 会使这成为可能:Odo 文档提到了 Parquet,但这些示例似乎都经过外部 Hive 运行时。

pandas 0.21 introduces new functions for Parquet : pandas 0.21 为 Parquet引入了新功能

pd.read_parquet('example_pa.parquet', engine='pyarrow')

or或者

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:上面的链接解释了:

These engines are very similar and should read/write nearly identical parquet format files.这些引擎非常相似,应该读/写几乎相同的镶木地板格式文件。 These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).这些库因具有不同的底层依赖关系而不同(fastparquet 使用 numba,而 pyarrow 使用 c 库)。

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet.更新:自从我回答这个问题以来,为了更好地读写 parquet,在 Apache Arrow 上已经做了很多工作。 Also: http://wesmckinney.com/blog/python-parquet-multithreading/另外: http : //wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python有一个 python parquet reader 工作得比较好: https : //github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.它将创建 python 对象,然后您必须将它们移动到 Pandas DataFrame,因此该过程将比pd.read_csv慢。

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe除了 Pandas,Apache pyarrow 还提供了将 parquet 转换为 dataframe 的方法

The code is simple, just type:代码很简单,只需输入:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files有关更多信息,请参阅 Apache pyarrow Reading and Writing Single Files 中的文档

Parquet files are always large. Parquet 文件总是很大。 so read it using dask.所以使用 dask 阅读它。

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

files = glob.glob('data/*.parquet')

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()

When writing to parquet, consider using brotli compression.写入 parquet 时,请考虑使用 brotli 压缩。 I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression.通过使用 brotli 压缩,我将 8GB 文件镶木地板文件的大小减少了 70%。 Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle.与 gzip、snappy、pickle 相比,Brotli 使文件更小,读/写速度更快。 Although pickle can do tuples whereas parquet does not.虽然泡菜可以做元组,而镶木地板不能。

df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

Considering the .parquet file named data考虑名为data.parquet文件

parquet_file = '../data.parquet'

open( parquet_file, 'w+' )

Then use pandas.to_parquet (this function requires either the fastparquet or pyarrow library)然后使用pandas.to_parquet (这个函数需要fastparquetpyarrow库)

parquet_df.to_parquet(parquet_file)

Then, use pandas.read_parquet() to get a dataframe然后,使用pandas.read_parquet()获取数据帧

new_parquet_df = pd.read_parquet(parquet_file)

Parquet实木复合地板

Step 1: Data to play with第 1 步:要使用的数据

df = pd.DataFrame({
    'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
    'marks': [20,10,22,21,22],
})

Step 2: Save as Parquet第 2 步:另存为 Parquet

df.to_parquet('sample.parquet')

Step 3: Read from Parquet第 3 步:从 Parquet 读取

df = pd.read_parquet('sample.parquet')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM