如何将 Parquet 文件读入 Pandas DataFrame？

Question

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark?如何在不设置集群计算基础设施（如 Hadoop 或 Spark）的情况下将中等大小的 Parquet 数据集读入内存中的 Pandas DataFrame？ This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop.这只是我想在笔记本电脑上使用简单的 Python 脚本在内存中读取的适量数据。 The data does not reside on HDFS.数据不驻留在 HDFS 上。 It is either on the local file system or possibly in S3.它要么在本地文件系统上，要么在 S3 中。 I do not want to spin up and configure other services like Hadoop, Hive or Spark.我不想启动和配置其他服务，如 Hadoop、Hive 或 Spark。

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.我认为 Blaze/Odo 会使这成为可能：Odo 文档提到了 Parquet，但这些示例似乎都经过外部 Hive 运行时。

Answer 1

pandas 0.21 introduces new functions for Parquet : pandas 0.21 为 Parquet引入了新功能：

pd.read_parquet('example_pa.parquet', engine='pyarrow')

or或者

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:上面的链接解释了：

These engines are very similar and should read/write nearly identical parquet format files.这些引擎非常相似，应该读/写几乎相同的镶木地板格式文件。 These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).这些库因具有不同的底层依赖关系而不同（fastparquet 使用 numba，而 pyarrow 使用 c 库）。

Answer 2

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet.更新：自从我回答这个问题以来，为了更好地读写 parquet，在 Apache Arrow 上已经做了很多工作。 Also: http://wesmckinney.com/blog/python-parquet-multithreading/另外： http : //wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python有一个 python parquet reader 工作得比较好： https : //github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.它将创建 python 对象，然后您必须将它们移动到 Pandas DataFrame，因此该过程将比pd.read_csv慢。

Answer 3

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe除了 Pandas，Apache pyarrow 还提供了将 parquet 转换为 dataframe 的方法

The code is simple, just type:代码很简单，只需输入：

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files有关更多信息，请参阅 Apache pyarrow Reading and Writing Single Files 中的文档

Answer 4

Parquet files are always large. Parquet 文件总是很大。 so read it using dask.所以使用 dask 阅读它。

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

files = glob.glob('data/*.parquet')

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()

Answer 5

When writing to parquet, consider using brotli compression.写入 parquet 时，请考虑使用 brotli 压缩。 I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression.通过使用 brotli 压缩，我将 8GB 文件镶木地板文件的大小减少了 70%。 Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle.与 gzip、snappy、pickle 相比，Brotli 使文件更小，读/写速度更快。 Although pickle can do tuples whereas parquet does not.虽然泡菜可以做元组，而镶木地板不能。

df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

Answer 6

Considering the .parquet file named data考虑名为data的.parquet文件

parquet_file = '../data.parquet'

open( parquet_file, 'w+' )

Then use pandas.to_parquet (this function requires either the fastparquet or pyarrow library)然后使用pandas.to_parquet （这个函数需要fastparquet或pyarrow库）

parquet_df.to_parquet(parquet_file)

Then, use pandas.read_parquet() to get a dataframe然后，使用pandas.read_parquet()获取数据帧

new_parquet_df = pd.read_parquet(parquet_file)

Answer 7

Parquet实木复合地板

Step 1: Data to play with第 1 步：要使用的数据

df = pd.DataFrame({
    'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
    'marks': [20,10,22,21,22],
})

Step 2: Save as Parquet第 2 步：另存为 Parquet

df.to_parquet('sample.parquet')

Step 3: Read from Parquet第 3 步：从 Parquet 读取

df = pd.read_parquet('sample.parquet')

如何将 Parquet 文件读入 Pandas DataFrame？

问题描述

7 个解决方案

解决方案1
114 已采纳 2017-10-31 13:12:54

解决方案2
19 2015-11-19 20:46:29

解决方案3
10 2019-12-28 10:04:30

解决方案4
0 2021-04-27 10:30:07

解决方案5
0 2021-05-08 08:24:56

解决方案6
0 2021-05-14 15:14:31

解决方案7
0 2021-08-26 07:27:10

Parquet实木复合地板

Step 1: Data to play with第 1 步：要使用的数据

Step 2: Save as Parquet第 2 步：另存为 Parquet

Step 3: Read from Parquet第 3 步：从 Parquet 读取

如何将 Parquet 文件读入 Pandas DataFrame？

问题描述

7 个解决方案

解决方案1 114 已采纳 2017-10-31 13:12:54

解决方案2 19 2015-11-19 20:46:29

解决方案3 10 2019-12-28 10:04:30

解决方案4 0 2021-04-27 10:30:07

解决方案5 0 2021-05-08 08:24:56

解决方案6 0 2021-05-14 15:14:31

解决方案7 0 2021-08-26 07:27:10

Parquet实木复合地板

Step 1: Data to play with第 1 步：要使用的数据

Step 2: Save as Parquet第 2 步：另存为 Parquet

Step 3: Read from Parquet第 3 步：从 Parquet 读取

解决方案1
114 已采纳 2017-10-31 13:12:54

解决方案2
19 2015-11-19 20:46:29

解决方案3
10 2019-12-28 10:04:30

解决方案4
0 2021-04-27 10:30:07

解决方案5
0 2021-05-08 08:24:56

解决方案6
0 2021-05-14 15:14:31

解决方案7
0 2021-08-26 07:27:10