I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it.
Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame
Parquet support.
I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql
?
Update (March 2017): There are currently 2 libraries capable of writing Parquet files:
Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data eg), so you will have to check whether they support everything you need.
OLD ANSWER:
As of 2.2016 there seems to be NO python-only library capable of writing Parquet files.
If you only need to read Parquet files there is python-parquet .
As a workaround you will have to rely on some other process like eg pyspark.sql
(which uses Py4J and runs on the JVM and can thus not be used directly from your average CPython program).
fastparquet does have write support, here is a snippet to write data to a file
from fastparquet import write
write('outfile.parq', df)
using fastparquet
you can write a pandas df
to parquet either with snappy
or gzip
compression as follows:
make sure you have installed the following:
$ conda install python-snappy
$ conda install fastparquet
do imports
import pandas as pd
import snappy
import fastparquet
assume you have the following pandas df
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
send df
to parquet with snappy
compression
df.to_parquet('df.snap.parquet',compression='snappy')
send df
to parquet with gzip
compression
df.to_parquet('df.gzip.parquet',compression='gzip')
check:
read parquet back into pandas df
pd.read_parquet('df.snap.parquet')
or
pd.read_parquet('df.gzip.parquet')
output:
col1 col2
0 1 3
1 2 4
I've written a comprehensive guide to Python and Parquet with an emphasis on taking advantage of Parquet's three primary optimizations: columnar storage , columnar compression and data partitioning . There is a fourth optimization that isn't covered yet, row groups, but they aren't commonly used. The ways of working with Parquet in Python are pandas, PyArrow, fastparquet, PySpark, Dask and AWS Data Wrangler.
Check out the post here: Python and Parquet Performance In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask
Assuming, df
is the pandas dataframe. We need to import the following libraries.
import pyarrow as pa
import pyarrow.parquet as pq
First, write the dataframe df
into a pyarrow
table.
# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df_image_0)
Second, write the table
into parquet
file say file_name.parquet
# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')
Parquet with Snappy compression
pq.write_table(table, 'file_name.parquet')
Parquet with GZIP compression
pq.write_table(table, 'file_name.parquet', compression='GZIP')
Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet', compression='BROTLI')
Reference: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/
pyspark
seems to be the best alternative right now for writing out parquet with python. It may seem like using a sword in place of needle, but thats how it is at the moment.
Simply do, pip install pyspark
and you are good to go.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Two more Python libraries for fast CSV => parquet transformations:
May not have all the bells and whistles of fastparquet but are really fast and easy to master.
Edit Polars can write parquet using Arrows, which supports new parquet versions and options: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.