简体繁体中英

which way is best to read the parquet file to process as dask dataframe

原文 2020-05-19 06:49:55 8 1 python/ dask/ parquet/ pyarrow/ dask-dataframe

I have directory with small parquet files(600), I wanted to do ETL on those parquet and merge those parquet to 128mb each file. what is the optimal way to process the data.

should I read each file in the parquet directory and concat as a single data frame and do groupBY? or provide parquet directory name to dd.read_parquet and process it?

I feel like, when I read the file by file, it creates a very large dask graph that cannot be fit as an image. I guess it will work with those many numbers of threads as well? which leads to a memory error.

which way is best to read the parquet file to process as dask dataframe? file by file or provide entire directory??

1 answers

Unfortunately there is no single best eway to read a Parquet file for all situations. In order to properly answer the question you will need to know more about your situation.

dask dataframe read parquet schema difference

Specifying dtype for parquet partition fields with dask.dataframe.read_parquet

Best way to read and process parquet files stored in GCP using pyspark

Error Appending to Parquet File Using Dask Dataframe

Which file is causing `dask.dataframe.read_csv` to fail?

dask dataframe column renames are slow(er) when read from parquet

Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

How do I filter dask.dataframe.read_parquet with timestamp?

read process and concatenate pandas dataframe in parallel with dask

Repartioning parquet file dask

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question dask dataframe read parquet schema difference Specifying dtype for parquet partition fields with dask.dataframe.read_parquet Best way to read and process parquet files stored in GCP using pyspark Error Appending to Parquet File Using Dask Dataframe Which file is causing `dask.dataframe.read_csv` to fail? dask dataframe column renames are slow(er) when read from parquet Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet How do I filter dask.dataframe.read_parquet with timestamp? read process and concatenate pandas dataframe in parallel with dask Repartioning parquet file dask

Related Tags

which way is best to read the parquet file to process as dask dataframe

Question

1 answers

solution1 0 2020-05-23 17:41:12

solution1
0 2020-05-23 17:41:12